35
HUMaaaE [Transcriptome Resequencing Report] 2017/3/9 @2017 BGI All Rights Reserved

Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

  • Upload
    buiminh

  • View
    219

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

HUMaaaE [TranscriptomeResequencing Report]

2017/3/9

@2017 BGI All Rights Reserved

Page 2: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

3334457789

121415161617171718181920202122232424242425

Table of Contents

Results1 Abstract2 Sequencing Reads Filtering3 Genome Mapping4 Novel Transcript Prediction5 SNP and INDEL Detection6 RNA editing Detection7 Gene Fusion Detection8 Differentially Splicing Gene Detection9 Gene Expression Analysis10 Differentially Expressed Gene Detection11 Hierarchical Clustering Analysis of DEG12 Gene Ontology Analysis of DEG13 Pathway Analysis of DEG14 Transcription Factor Prediction of DEG15 PPI Analysis of DEG

Methods1 Transcriptome Resequencing Study Process2 Sequencing Reads Filtering3 Genome Mapping4 Novel Transcript Prediction5 SNP and INDEL Detection6 RNA editing Detection7 Gene Fusion Detection8 Differentially Splicing Gene Detection9 Gene Expression Analysis10 Differentially Expressed Gene Detection11 Hierarchical Clustering Analysis of DEG12 Gene Ontology Analysis of DEG13 Pathway Analysis of DEG14 Transcription Factor Prediction of DEG

1/33

Page 3: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

2525252626262728282829292929303132

15 PPI Analysis of DEGHelp

1 FASTQ Format2 What is TF3 RNA editing format4 Gene fusion format5 DSG format6 Gene expression list format7 DEG list format8 MA plot9 Volcano plot10 Cluster list format11 VCF format12 How to read DEG GO enrichment analysis result13 How to read DEG pathway enrichment analysis result

FAQsReferences

2/33

Page 4: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Results

1 AbstractIn our project, we sequence 6 samples use Illumina Hiseq platform, and on average we

generated about 0.64 Gb bases from each sample. After mapping sequenced reads to referencegenome and reconstruct transcripts,we finally get 10,553 novel transcripts from all samples, of this,8,865 are previously unknown splicing event for known gene, 313 are novel coding transcriptswithout any known features, and the remaining 1,375 are long noncoding RNA.

2 Sequencing Reads FilteringThe sequencing reads which containing low-quality, adaptor-polluted and high content of

unknown base(N) reads, should be processed to remove this reads before downstream analyses.After filtering, reads quality metrics are shown as Table1. The distribution of base content and qualityare shown as Figure1 and Figure2, respectively.

Table 1 Summary of sequencing reads after filtering. (Download)

Sample Total RawReads(Mb)

Total CleanReads(Mb)

Total CleanBases(Gb)

Clean ReadsQ20(%)

Clean ReadsQ30(%)

Clean ReadsRatio(%)

HBRR1 7.78 6.97 0.63 98.91 95.17 89.68

HBRR2 7.78 7.12 0.64 98.94 95.30 91.59

HBRR3 7.78 7.16 0.64 98.85 95.06 92.09

UHRR1 7.78 7.07 0.64 98.60 94.02 90.89

UHRR2 7.78 7.18 0.65 98.63 93.99 92.32

UHRR3 7.78 7.12 0.64 98.72 94.45 91.47

Q20: the rate of bases which quality is greater than 20.

Figure 1 Distribution of base composition on clean reads.

X axis represents base position along reads.Y axis represents base content percentage. As to high quality sequencingreads, A(adenine base) curve should be strictly overlapped with T(thymine base) curve and G(guanine bsase) curve

should be overlapped with C(cytosine base) curve according to the principle of complementary of base pairing, excludingthe first six base positions owing to Illumina sequencing platform using random hexamer-primer to synthesize cDNA which

could result in PCR bias. As shown if figure, big fluctuations in first six base positons along reads, it is normal situation. If

Confirm Show All

HBRR1

HBRR2

HBRR3

UHRR1

UHRR2

UHRR3

3/33

Page 5: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

abnormal condition happens during sequencing, it may show an unbalanced composition.

Figure 2 Distribution of base quality on clean reads.

X axis represents base positions along reads. Y axis represents base quality value. Each dot in the image represents thenumber of total bases with certain quality value of the corresponding base along reads. Darker dot color means greater

bases number. If the percentage of the bases with low quality (< 20) is very high, then the sequencing quality of this lane isbad.

3 Genome Mapping

After reads filtering, we map clean reads to reference genome use HISAT [1]. On average94.58% reads are mapped, and the uniformity of the mapping result for each sample suggests thatthe samples are comparable. The mapping details are shown as Table2.

Table 2 Summary of Genome Mapping (Download)

Sample Total CleanReads Total MappingRatio Uniquely MappingRatio

HBRR1 6,974,930 94.00% 89.04%

HBRR2 7,123,888 93.88% 88.88%

HBRR3 7,162,676 93.88% 88.86%

UHRR1 7,069,786 95.19% 88.92%

UHRR2 7,180,650 95.37% 88.67%

UHRR3 7,115,506 95.14% 89.09%

Uniquely Mapping: Reads that map to only one location of reference, called uniquely mapping.

4 Novel Transcript Prediction

After genome mapping, we use StringTie [2] to reconstruct transcripts, and with genomeannotation information we can identify novel transcripts exist in our samples use cuffcompare, a toolof cufflinks [3]. In total, we identify 10,553 novel transcripts, the detailed information is shown asTable3. For novel transcript of each sample, please see "Gene Expression Analysis" session.

Table 3 Summary of Novel Transcripts (Download)

Confirm Show All

HBRR1

HBRR2

HBRR3

UHRR1

UHRR2

UHRR3

4/33

Page 6: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Total_Novel_Transcript Coding_Transcript Noncoding_Transcript NovelIsoform NovelGene

10,553 9,178 1,375 8,865 313

NovelIsoform: a coding transcript that previously unknown splicing event for a known gene.

5 SNP and INDEL Detection

After genome mapping, we use GATK \[4\] to call SNP and INDEL variant for each sample.Final results are stored in VCF format. The SNP summary is shown as Table4, and Figure3. We alsogenerate a friendly-interfaced SNP summary in EXCEL format shown as Table17. And then westatistic the location of SNP and INDEL , shown as Figure4 and Figure5.

Table 4 SNP variant type summary. (Download)

Sample A-G C-T Transition A-C A-T C-G G-T Transversion Total

HBRR1 8,554 8,455 17,009 1,416 895 2,124 1,528 5,963 22,972

HBRR2 8,717 8,560 17,277 1,481 951 2,150 1,519 6,101 23,378

HBRR3 8,381 8,229 16,610 1,467 870 2,079 1,493 5,909 22,519

UHRR1 8,853 8,453 17,306 1,540 948 2,258 1,494 6,240 23,546

UHRR2 8,459 8,325 16,784 1,472 933 2,167 1,519 6,091 22,875

UHRR3 9,069 8,975 18,044 1,541 972 2,297 1,546 6,356 24,400

Transition: variant between purines or pyrimidines.Transversion: variant between purine and pyrimidine.

Figure 3 SNP variant type distribution.

X axis represents the type of SNP. Y axis represents the number of SNP.

5/33

Page 7: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Figure 4 Distribution of SNP location.

Up2k means upstream 2000 bp area of a gene. Down2k means downstream 2000 bp area of a gene.

Figure 5 Distribution of INDEL location.

Up2k means upstream 2000 bp area of a gene. Down2k means downstream 2000 bp area of a gene.

The VCF format SNP and INDEL result of each sample are shown as tables below(see VCFformat in help page):Table 5 SNP list of HBRR1 (Download)Table 6 SNP list of HBRR2 (Download)Table 7 SNP list of HBRR3 (Download)Table 8 SNP list of UHRR1 (Download)Table 9 SNP list of UHRR2 (Download)Table 10 SNP list of UHRR3 (Download)Table 11 INDEL list of HBRR1 (Download)Table 12 INDEL list of HBRR2 (Download)Table 13 INDEL list of HBRR3 (Download)

Confirm Show All

HBRR1

HBRR2

HBRR3

UHRR1

UHRR2

UHRR3

Confirm Show All

HBRR1

HBRR2

HBRR3

UHRR1

UHRR2

UHRR3

6/33

Page 8: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Table 14 INDEL list of UHRR1 (Download)Table 15 INDEL list of UHRR2 (Download)Table 16 INDEL list of UHRR3 (Download)Table 17 Summary of population SNP (Download)

6 RNA editing DetectionWith provided DNA SNP information, we can detection RNA editing for these samples. RNA

editing statistic is shown as Figure6, and the details information are shown as tables below:

Figure 6 Distribution of RNA editing types.

X axis represents editing type. Y axis represents number of editing event.

Table 18 RNA editing list of HBRR1 (Download)Table 19 RNA editing list of HBRR2 (Download)Table 20 RNA editing list of HBRR3 (Download)Table 21 RNA editing list of UHRR1 (Download)Table 22 RNA editing list of UHRR2 (Download)Table 23 RNA editing list of UHRR3 (Download)

7 Gene Fusion Detection

After reads filtering, we use SOAPfuse [5] to detecte gene fusion for each sample, shown asFigure7.

Confirm Show All

HBRR1

HBRR2

HBRR3

UHRR1

UHRR2

UHRR3

7/33

Page 9: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Figure 7 Gene fusion analysis.

Circos diagram for gene fusion visualization.

Gene fusion details for each sample is shown as tables below(see Gene fusion format on helppage):Table 24 Gene fusion list of HBRR1 (Download)Table 25 Gene fusion list of HBRR2 (Download)Table 26 Gene fusion list of HBRR3 (Download)Table 27 Gene fusion list of UHRR1 (Download)Table 28 Gene fusion list of UHRR2 (Download)Table 29 Gene fusion list of UHRR3 (Download)

8 Differentially Splicing Gene Detection

After genome mapping, we use rMATS \[6\] to detect differentially splicing gene( DSG ) betweensamples. DSG is regulated by alternative splicing(AS), which allows the production of a variety ofdifferent isofroms from one gene only. Changes in relative abundance of isoforms, regardless of theexpression change, indicate a splicing-related mechanism. We detect 5 types AS events, includeSkipped Exon(SE), Alternative 5' Splicing Site(A5SS), Alternative 3' Splicing Site(A3SS), Mutuallyexclusive exons(MXE) and Retained Intron(RI), the results of each provided compare plan areshown as tables blow(see DSG format in help page), and the Gene Ontology classification isshown as Figure8.

Confirm Show All

HBRR1

HBRR2

HBRR3

UHRR1

UHRR2

UHRR3

8/33

Page 10: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Figure 8 Gene Ontology classification of DSGs.

X axis represents the Gene Ontoloty functions. Y axis represents the number of DSGs.

Table 30 A3SS regulated DSG list of HBRR-VS-UHRR (Download)Table 31 A5SS regulated DSG list of HBRR-VS-UHRR (Download)Table 32 MXE regulated DSG list of HBRR-VS-UHRR (Download)Table 33 RI regulated DSG list of HBRR-VS-UHRR (Download)Table 34 SE regulated DSG list of HBRR-VS-UHRR (Download)Table 35 A3SS regulated DSG list of HBRR1-VS-UHRR1 (Download)Table 36 A5SS regulated DSG list of HBRR1-VS-UHRR1 (Download)Table 37 MXE regulated DSG list of HBRR1-VS-UHRR1 (Download)Table 38 RI regulated DSG list of HBRR1-VS-UHRR1 (Download)Table 39 SE regulated DSG list of HBRR1-VS-UHRR1 (Download)

9 Gene Expression AnalysisAfter novel transcript detection, we merge novel coding transcripts with reference transcript to

get complete reference, then we mapped clean reads to it use Bowtie2 \[7\], then calculate geneexpression level for each sample with RSEM \[8\]. The gene expression summary is shown asTable40. And the gene expression list of each sample is shown as tables below(see Geneexpression list format in help page).

We then calculate the reads coverage and the reads distribution on each detected transcript,shown as Figure9 and Figure \figure{reads_dis_figure}, respectively. After that, we calculatepearson correlation between all samples, shown as Figure11. Hierarchical clustering between allsamples is also performed, shown as Figure12. With provided PCA plan, we perform PCA analysis,shown as Figure13. We also use venn diagram to display expressed gene between samples, shownas Figure14.

Table 40 Summary of gene expression (Download)

Confirm Show All

HBRR-VS-UHRR

HBRR1-VS-UHRR1

9/33

Page 11: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Sample TotalCleanReads

TotalMappingRatio

UniquelyMappingRatio

TotalGeneNumber

KnownGeneNumber

NovelGeneNumber

TotalTranscriptNumber

HBRR1 6,974,930 58.77% 18.78% 16503 16315 188 23581

HBRR2 7,123,888 58.24% 18.57% 16517 16330 187 23579

HBRR3 7,162,676 57.54% 18.26% 16473 16275 198 23492

UHRR1 7,069,786 71.44% 25.12% 16679 16494 185 24002

UHRR2 7,180,650 70.23% 24.70% 16727 16534 193 24016

UHRR3 7,115,506 69.35% 24.40% 16733 16542 191 24066

Uniquely Mapping: Reads that map to only one location of reference, called uniquely mapping.

Figure 9 Reads coverage on transcripts.

X axis represents the reads coverage. Y axis on left represents the percentage of transcripts. Y axis on right represents thedensity of transcripts.

Figure 10 Reads distribution on transcripts.

X axis represents the position along transcripts. Y axis represents the number of reads.

Confirm Show All

HBRR1

HBRR2

HBRR3

UHRR1

UHRR2

UHRR3

Confirm Show All

HBRR1

HBRR2

HBRR3

UHRR1

UHRR2

UHRR3

10/33

Page 12: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Figure 11 Heatmap of pearson correlation between samples.

Both X and Y axis represent each sample. Coloring indicate pearson correlation(high: blue, low: white).

Figure 12 Hierarchical clustering between samples.

More closer indicate more similar expression profile bewteen samples.

11/33

Page 13: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Figure 13 PCA analysis.

X axis represents the contributor rate of first component. Y axis represents the contributor rate of secend component. Pointsrepresent each sample.

Figure 14 Venn diagram analysis.

Table 41 Expressed gene list of HBRR1 (Download)Table 42 Expressed gene list of HBRR2 (Download)Table 43 Expressed gene list of HBRR3 (Download)Table 44 Expressed gene list of UHRR1 (Download)Table 45 Expressed gene list of UHRR2 (Download)Table 46 Expressed gene list of UHRR3 (Download)

10 Differentially Expressed Gene DetectionWith gene expression result, we detect Differentially Expressed Gene( DEG ) between

samples, the DEG summary is shown as Figure15.We also show the DEG distribution using MAplot(see MA plot in help page) and Volcano plot(se Volcano plot in help page), shown as Figure16and Figure17, respectively.

Confirm Show All

HBRR-UHRR

HBRR-UHRR3-UHRR1

Confirm Show All

HBRR1-HBRR2-HBRR3

UHRR1-UHRR2-UHRR3

12/33

Page 14: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Figure 15 Summary of DEGs.

X axis represents comparing samples. Y axis represents DEG numbers. Red color represents up regulated DEGs. Bluecolor represents down regulated DEGs.

Figure 16 MA plot of DEGs.

X axis represents value A(log2 transformed mean expression level). Y axis represents value M(log2 transformed foldchange). Red points represent up regulated DEG. Blue points represent down regulated DEG. Black points represent non-

DEGs.

Confirm Show All

HBRR-VS-UHRR.DEseq2_Method

HBRR-VS-UHRR.EBseq_Method

HBRR-VS-UHRR.NOIseq_Method

HBRR1-VS-UHRR1.PossionDis_Method

HBRR2-VS-UHRR2.PossionDis_Method

13/33

Page 15: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Figure 17 Volcano plot of DEGs.

X axis represents -log10 transformed significance. Y axis represents log2 transformed fold change. Red points representup regulated DEG. Blue points represent down regulated DEG. Black points represent non-DEGs.

DEG lists are shown as tables below(see DEG list format in help page):Table 47 DEG list of HBRR-VS-UHRR.DEseq2_Method (Download)Table 48 DEG list of HBRR-VS-UHRR.EBseq_Method (Download)Table 49 DEG list of HBRR-VS-UHRR.NOIseq_Method (Download)Table 50 DEG list of HBRR1-VS-UHRR1.PossionDis_Method (Download)Table 51 DEG list of HBRR2-VS-UHRR2.PossionDis_Method (Download)

11 Hierarchical Clustering Analysis of DEGWith DEGs, we perform hierarchical clustering for DEGs, shown as Figure18.

Figure 18 Heatmap of hierarchical clustering of DEGs.

X axis represents each comparing samples. Y axis represents DEGs. Coloring indicate fold change(high: red, low: blue).

Confirm Show All

HBRR-VS-UHRR.DEseq2_Method

HBRR-VS-UHRR.EBseq_Method

HBRR-VS-UHRR.NOIseq_Method

HBRR1-VS-UHRR1.PossionDis_Method

HBRR2-VS-UHRR2.PossionDis_Method

Confirm Show All

HBRR-VS-UHRR.NOIseq-

HBRR-VS-UHRR.EBseq-

HBRR-VS-UHRR.DEseq2.inter

HBRR-VS-UHRR.NOIseq-

HBRR-VS-UHRR.EBseq-

HBRR-VS-UHRR.DEseq2.union

HBRR1-VS-

14/33

Page 16: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

The ordered DEG lists after hierarchical clustering are shown as tables below(see Cluster listformat in help page):T a b l e 52 Clustering DEGs list of HBRR-VS-UHRR.NOIseq-HBRR-VS-UHRR.EBseq-HBRR-VS-UHRR.DEseq2.inter (Download)T a b l e 53 Clustering DEGs list of HBRR-VS-UHRR.NOIseq-HBRR-VS-UHRR.EBseq-HBRR-VS-UHRR.DEseq2.union (Download)Table 54 Clustering DEGs list of HBRR1-VS-UHRR1.PossionDis-HBRR2-VS-UHRR2.PossionDis.inter (Download)Table 55 Clustering DEGs list of HBRR1-VS-UHRR1.PossionDis-HBRR2-VS-UHRR2.PossionDis.union (Download)

12 Gene Ontology Analysis of DEGWith DEGs, we perform Gene Ontology (GO) classification and functional enrichment for DEGs.

GO has three ontologies: molecular function, cellular component and biological process, we wouldperform functional enrichment respectively. The GO classification results are shown as Figure19,and the GO functional enrichment results are shown as Figure20.

Figure 19 GO classification of DEGs.

X axis represents number of DEG. Y axis represents GO term.

Confirm Show All

HBRR-VS-UHRR.DEseq2_Method

HBRR-VS-UHRR.EBseq_Method

HBRR-VS-UHRR.NOIseq_Method

HBRR1-VS-UHRR1.PossionDis_Method

HBRR2-VS-UHRR2.PossionDis_Method

Confirm Show All

HBRR-VS-UHRR.DEseq2_Method.Biological_Process

HBRR-VS-UHRR.DEseq2_Method.Cellular_Component

HBRR-VS-UHRR.DEseq2_Method.Molecular_Function

HBRR-VS-UHRR.EBseq_Method.Biological_Process

HBRR-VS-UHRR.EBseq_Method.Cellular_Component

HBRR-VS-UHRR.EBseq_Method.Molecular_Function

15/33

Page 17: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Figure 20 GO functional enrichment of DEGs.

Coloring indicate qvalue(high: yellow, low: red). The lower qvalue indicate the more significant enriched.

13 Pathway Analysis of DEGWith DEGs, we perform KEGG pathway classification an functional enrichment for DEGs. The

pathway classification results are shown as Figure21, and the pathway functional enrichment resultsare shown as Figure22.

Figure 21 Pathway classification of DEGs.

X axis represents number of DEG. Y axis represents pathway name.

Figure 22 Pathway functional enrichment of DEGs.

X axis represents enrichment factor. Y axis represents pathway name. Coloring indicate qvalue(high: white, low: blue), thelower qvalue indicate the more significant enriched. Pointsize indicate DEG number(more: big, less: small).

14 Transcription Factor Prediction of DEGAfter DEG detection, we predict DEGs that encode Transcription Factor( TF ) while doing plant

Confirm Show All

HBRR-VS-UHRR.DEseq2_Method

HBRR-VS-UHRR.EBseq_Method

HBRR-VS-UHRR.NOIseq_Method

HBRR1-VS-UHRR1.PossionDis_Method

HBRR2-VS-UHRR2.PossionDis_Method

Confirm Show All

HBRR-VS-UHRR.DEseq2_Method

HBRR-VS-UHRR.EBseq_Method

HBRR-VS-UHRR.NOIseq_Method

HBRR1-VS-UHRR1.PossionDis_Method

HBRR2-VS-UHRR2.PossionDis_Method

16/33

Page 18: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

research. The list of DEG that encode TF in our project is shown as tables below:Table 56 TF coding DEGs list of HBRR-VS-UHRR.DEseq2_Method (Download)Table 57 TF coding DEGs list of HBRR-VS-UHRR.EBseq_Method (Download)Table 58 TF coding DEGs list of HBRR-VS-UHRR.NOIseq_Method (Download)Table 59 TF coding DEGs list of HBRR1-VS-UHRR1.PossionDis_Method (Download)Table 60 TF coding DEGs list of HBRR2-VS-UHRR2.PossionDis_Method (Download)

15 PPI Analysis of DEGBase on NCBI' protein interaction database, we perform Protein-Protein Interaction(PPI)

analysis for DEGs. Firstly, open network_en.html in PPI analysis result folder with browser, thenselect your DEG list, and then enter your DEG identity to show the PPI network associated with this DEG . We also generate a TXT file that can be recognized by Cytoscape, an open source softwareplatform for complex network analysis and visualizing.

Methods

1 Transcriptome Resequencing Study ProcessAfter extract total RNA and treated with DNase I, Oligo(dT) are used to isolate mRNA. Mixed with

the fragmentation buffer, the mRNA are fragmented. Then cDNA is synthesized using the mRNAfragments as templates. Short fragments are purified and resolved with EB buffer for end reparationand single nucleotide A (adenine) addition. After that, the short fragments are connected withadapters. The suitable fragments are selected for the PCR amplification. During the QC steps,Agilent 2100 Bioanaylzer and ABI StepOnePlus Real-Time PCR System are used in quantificationand qualification of the sample library. Then the library is sequenced using Illumina HiSeq 4000 orother sequencer when necessary.

After sequencing, we get raw reads. Firstly, we filter low-quality, adaptor-polluted and highcontent of unknown base(N) reads to get clean reads. And then mapping clean reads to referencegenome, after that, novel transcript prediction SNP & INDEL detection differentially splicing gene( DSG ) detection are performed. After we get novel transcripts, we merge conding transcripts of themwith referecne transcript to get a complete reference, then we perform gene expression analysis withthis reference. After that, we can detect Differentially Expression Gene( DEG ) and perform furtherfunctional enrichment analysis between samples(two samples at least).Schematic overview of thecomprehensive process is shown as Figure1.

17/33

Page 19: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Figure 1 Transcriptome resequencing study process.

Schematic overview of the study process.

2 Sequencing Reads FilteringWe define raw reads as reads which containing low-quality, adaptor-polluted and high content of

unknown base(N) reads additionally, these noise reads should be removed before downstreamanalyses. We use internal software to filter reads, fllowed as:

1) Remove reads with adaptors;

2) Remove reads in which unknown bases(N) are more than 5%;

3) Remove low quality reads (we define the low quality read as the percentage of base whichquality is lesser than 10 is greater than 20% in a read).

After filtering, the remaining reads are called "Clean Reads" and stored in FASTQ [12] format(see FASTQ Format in help page).

3 Genome Mapping

We use HISAT[1] to perfrom genome mapping, HISAT is a fast and sensitive spliced alignmentprogram for mapping RNA-seq reads with equal or better accuracy than any other method. Thepaper show that, for simulated 20 million 100bp reads, the distribution of read types are shown asFigure2, about 40% reads are spanning multiple exons, HISAT perform very well on this type reads.

18/33

Page 20: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Figure 2 Distribution of read types.

(a) Five types of RNA-seq reads: (i) M, exonic read; (ii) 2M_gt_15, junction reads with long, >15-bp anchors in both exons;(iii) 2M_8_15, junction reads with intermediate, 8- to 15-bp anchors; (iv) 2M_1_7, junction reads with short, 1- to 7-bp,

anchors; and (v) gt_2M, junction reads spanning more than two exons. (b) Relative proportions of different types of reads inthe 20 million 100-bp simulated read data.

Software information:

HISAT:

version: v0.1.6-beta

parameters: --phred64 --sensitive --no-discordant --no-mixed -I 1 -X 1000

4 Novel Transcript PredictionA novel transcript is that a transcript which contains features not present in the reference

annotation, that is to say, a novel transcript can be both a new isoform of a previously known gene ora transcript without any known features. We use StringTie [2] to reconstruct transcripts, and usecuffcompare to compare reconstructed transcripts to reference annotation, after that, we select'u','i','o','j' class code types as novel transcripts, class code type details is shown as Table1. Andthen, we use CPC [13] to predict coding potential of novel transcripts, then we merge coding noveltranscripts with reference transcripts to get a complete reference, and downstream analysis will bebase on this reference.

Table 1 Explanation of class code (Download)

Class_Code Explanation

u Unknown, intergenic transcript.

i A transfrag falling entirely within a reference intron.

o Generic exonic overlap with a reference transcript.

j Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript.

The complete class code explanation could be found at cufflink website.Software information:

StringTie:

version: v1.0.4

parameters: -f 0.3 -j 3 -c 5 -g 100 -s 10000 -p 8

CuffCompare:

version: v2.2.1

parameters: -p 12

CPC:

version: v0.9-r219/33

Page 21: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

parameters: default

5 SNP and INDEL Detection

With genome mapping result, we use GATK \[4\] to call SNP and INDEL for each sample. Afterfilter out the unreliable sites, we get the final SNP and INDEL in VCF format. Software information:

GATK:

version: v3.4-0

parameters(call): -allowPotentiallyMisencodedQuals -stand_call_conf 20.0 -stand_emit_conf 20.0

parameters(filter): -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0"

website: https://www.broadinstitute.org/gatk

6 RNA editing DetectionRNA editing is an integral step in generating the diversity and plasticity of cellular RNA

signatures.In order to find out those true RNA editing sites, we developed a computational pipelinethat carefully controls for false positives while calling RNA editing events from genome and whole-transcriptome data of the same individual. The initially identified SNVs were then filtered by followingsteps:

(1) Basic filter. Retain SNVs that meet the following criteria: quality score of consensus genotype>= 20; covered depth >= 5; repeats (estimated copy number of the flankedsequence in genome) <=1.

(2) Read parameter filter. Parameters, such as sequencing quality score, distance of a potentialSNV to the end of the supporting read, and coverage depth of the SNV: distance cutoff = 15 (m, theminimal distance of a SNV site to its supporting reads' ends); quality score cutoff = 20 (q, minimal sequencing quality score of SNVcorresponding nucleotide); and supporting reads number cutoff =2 (n, minimal number of supporting reads that meet the above two cutoff parameters).

(3) RNA-DNA variants filter. Further, to focus on RNA-DNA variants only, sites of which DNAgenotypes are the same as RNA genotypes were removed.

(4) MES filter. Next, we removed misaligned reads that arise from mapping error inherent to themapping algorithm (MES). The MES set was generated as follows: read sequences were simulatedbased on all human genes (hg18 transcriptome) using MAQ without mutation (-r parameter). Afteralignment and SNV calling by SOAP2 and SOAPsnp, respectively, the identified SNVs were passedthrough the above two filters. The resultant collection of SNVs is termed MES and represents aninherently errorprone set of sites that are incorrectly called owing to the nature of mapping and/orcalling algorithms. Any SNVs derived from step 2 that matched the MES were removed.

(5) Strand filter. A strand bias filter was also installed to remove potential strand-specific errors insequences generated by the Illumina platform. For a particular SNV site, the reads carrying areference or alternative allele that maps to the plus and minus strand in the genome were countedand evaluated using a Fisher's exact test. Sites whose reads exhibited strand bias in distribution (P <0.01), and whose number of supporting reads mapped to either strand is <2, were discarded.

20/33

Page 22: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

(6) Multiple type of mismatches filter. Discard SNV candidate sites that display more than onenonreference type (e.g., reference allele is A, nonreference alleles are G and T).

(7) Editing degree filter. Finally, polymorphic sites with extreme degree of variation (100%) wereexcluded, based on our observation that >90% of sites in MES exhibited 100% variation. Degreeofediting for a particular site was calculated as the ratio of reads supporting the variant allele to thetotal number of reads covering the site.

7 Gene Fusion DetectionA fusion gene is a hybrid gene formed from two previously separate genes. It can occur as a

result of: translocation, interstitial deletion, or chromosomal inversion. Figure3 demonstrates thesethree types of how fusion gene forms. Often, fusion genes are oncogenes that cause cancer, andmost of them are found from hematological cancers, sarcomas and prostate cancer. Oncogenicfusion genes may lead to a gene product with a new or different function from the two fusion partners.Alternatively, a proto-oncogene is fused to a strong promoter, and thereby the oncogenic function isset to function by an upregulation caused by the strong promoter of the upstream fusion partner.

Figure 3 Gene fusion types.

A fusion gene can occurs as a result a A.) chromosomal translocation B.) Interstitial Deletion C.) Chromosomal Inversion.

Presence of certain chromosomal aberrations and their resulting fusion genes is commonlyused within cancer diagnostics in order to set a precise diagnosis. Chromosome banding analysis,fluorescence In Situ hybridization (FISH), and reverse transcription polymerase chain reaction (RT- PCR ) are common methods employed at diagnostic laboratories. These methods all have theirdistinct shortcomings due to the very complex nature of cancer genomes. Recent developmentssuch as highthroughput sequencing and custom DNA microarrays bear promise of introduction ofmore efficient methods.

SOAPfuse [5] is a novel tool to identify fusion transcripts from paired-end RNA-Seq data. The toolapplies an improved partial exhaustion algorithm to construct a library of fusion junction sequences,

21/33

Page 23: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

which can be used to efficiently identify fusion events, and employs a series of filters to nominatehigh-confidence fusion transcripts. Compared with other released tools, SOAPfuse is much moreaccurate, sensitive and efficient for fusion discovery and consumed less computing resources.Furthermore, SOAPfuse provides predicted junction sequences of fusion transcripts and schematicdiagrams of fusion events, which greatly improve the efficiency of fusions detection and stronglypromote disease research, especially cancer research. These advantages mentioned above issignificant for clinical molecular typing and new anti-tumor drug development. Figure4 is the workflowof SOAPfuse.

Figure 4 SOAPfuse workflow.

SOAPfuse workflow.

8 Differentially Splicing Gene DetectionIt is important to distinguish differential isoform relative abundance, from differential isoform

expression. Changes in relative abundance of isoforms, regardless of the expression change,indicate a splicing-related mechanism. On the other hand, there can be measurable changes in theexpression of isoforms across samples, without necessarily changing the relative abundance, whichpossibly indicates a transcription-related mechanism. We use rMATS \[6\] to detect differentiallysplicing gene(that is differential isoform relative abundance between samples), a computational toolto detect differential alternative splicing events from RNA-Seq data, it calculate the inclusion isoformand skipping isofrom, shown as Figure5.The statistical model of MATS calculates the P-value andfalse discovery rate( FDR ) that the difference in the isoform ratio of a gene between two conditions, inour project, gene that with FDR <= 0.05 are defined as differentially splicing gene( DSG ).

22/33

Page 24: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

Figure 5 The different types of alternative splicing events.

The schematic diagrams illustrating the read counts and effective lengths of different categories of alternative splicingevents.

Software information:

rMATS:

version: v3.0.9

parameters: -analysis U -t paired -a 8

website: http://rnaseq-mats.sourceforge.net

9 Gene Expression Analysis

we mapped clean reads to reference using Bowtie2 \[7\], and then calculate gene expressionlevel with RSEM \[8\]. RSEM is a software package for estimating gene and isoform expressionlevels from RNA-Seq data. With mapping result, we calculate reads coverage and reads distributionon transcripts. For a sample with hight quality and deep-enough depth, most of transcripts would beentirely covered, and mapped reads would be uniformly distributed on transcripts. After that, wecalculate pearson correlation between all samples use cor, a function of R. After that, we perfromhierarchical clustering between all samples use hclust, a function of R. Then we perform PCAanalysis with all samples using princomp, a function of R. Then we calculate the overlap ofexpressed gene bewteen samples, show with venn diagram. Software information:

Bowtie2 :

version: v2.1.0

parameters: -q --phred64 --sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1 -I 1 -X 1000 --no-mixed --no-discordant -p 1 -k 200

website: http://bowtie-bio.sourceforge.net/ Bowtie2 /index.shtml

RSEM :

23/33

Page 25: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

version: v1.2.12

parameters: default

website: http://deweylab.biostat.wisc.edu/ RSEM

10 Differentially Expressed Gene DetectionWe detect DEGs with DEseq2, EBseq, NOIseq and PossionDis as requested. DEseq2 is based

on the negative binomial distribution, performed as described at Michael I, et al. \[14\] EBseq is basedon mpirical Bayesian model, peformed as described at Leng N, et al. \[15\] NOIseq is based on noisydistribution model, performed as described at Tarazona S, et al. \[16\] PossionDis is based on thepoisson distribution, peformed as described at Audic S, et al. \[17\] Software information:

DEseq2:

parameters: Fold Change >= 2.00 and Adjusted Pvalue <= 0.05

EBseq:

parameters: Fold Change >= 2.00 and Posterior Probability of being Equivalent Expression(PPEE) <= 0.05

NOIseq:

parameters: Fold Change >= 2.00 and Probability >= 0.8

PossionDis:

parameters: Fold Change >= 2.00 and FDR <= 0.001

11 Hierarchical Clustering Analysis of DEGWe perform hierarchical clustering for DEGs using pheatmap, a function of R. For clustering

more than two groups, we perform the intersection and union DEGs between them, respectively.

12 Gene Ontology Analysis of DEGWith the GO annotation result, we classify DEGs according to offical classification, and we also

perfrom GO functional enrichment using phyper, a function of R. The pvalue calculating formula inhypergeometric test is:

See wiki for details https://en.wikipedia.org/wiki/Hypergeometric_distribution.

Then we calculate false discovery rate( FDR ) for each pvalue, in general, the terms which FDR not larger than 0.001 are defined as significant enriched.

13 Pathway Analysis of DEGWith the KEGG annotation result, we classify DEGs according to offical classification, and we

also perform pathway functional enrichment using phyper, a function of R. The pvalue calculatingformula in hypergeometric test is:

24/33

Page 26: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

See wiki for details https://en.wikipedia.org/wiki/Hypergeometric_distribution.

Then we calculate false discovery rate( FDR ) for each pvalue, in general, the terms which FDR not larger than 0.001 are defined as significant enriched.

14 Transcription Factor Prediction of DEG

We use getorf \[18\] to find ORF of each DEG , then align ORF to TF domains(form PlntfDB) usehmmsearch \[19\], and identify TF according to the regulations described here(form PlantfDB).Software information:

getorf:

version: EMBOSS:6.5.7.0

parameters: -minsize 150

website: http://genome.csdb.cn/cgi-bin/emboss/help/getorf

hmmseach:

version: v3.0

parameters: default

website: http://hmmer.org

15 PPI Analysis of DEGDifferent proteins often form protein complex through complicated interactions to perform their

biological functions. Protein-Protein Interaction(PPI) analysis show that, DEGs that show interactionnetworks between their proteins may have similar function. For further network analysis, wegenerate a TXT file than can be import to Cytoscape, an open source software platform for complexnetwork analysis and visualizing, for more details about Cytoscape, please visit it's webiste here.

Help

1 FASTQ FormatThe original image data is transferred into sequence data via base calling , which is defined as

raw data or raw reads and saved as FASTQ file. Those FASTQ files are the original data provided forusers, including detailed read sequences and the read quality information. In each FASTQ file, everyread is described by four lines, listed as follows:

@A80GVTABXX:4:1:2587:1979#ACAGTGAT/1

NTTTGATATGTGTGAGGACGTCTGCAGCGTCACCTTTATCGGCCATGGT

+

BMMTKZXUUUdddddddddddddddddddddddddddadddddd^WYYU

25/33

Page 27: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

The first and third lines are sequences names generated by the sequence analyzer; the secondline is sequence; the fourth line is sequencing quality value, in which each letter corresponds to thebase in line 2; the base quality is equal to ASCII value of the character in line 4 minus 64(we call thequality system is Phred+64), e.g. the ASCII value of c is 99, then its base quality value is 35. Startingfrom the Illumina GA Pipeline v1.5, the range of base quality values is from 2 to 41. Table1demonstrates the relationship between sequencing error rate and the sequencing quality value.Specifically, if the sequencing error rate is denoted as E and base quality value is denoted as Q, therelationship is as following formula:

Table 1 Relationship between sequencing error rate and sequencing quality value (Download)

Sequencing Error Rate(%) Sequencing Quality Value Character(Phred+46) Character(Phred+33)

1.00 20 T 5

0.10 30 ^ ?

0.01 40 h I

More detaild information about FASTQ format can be got in websitehttp://en.wikipedia.org/wiki/FASTQ_format.

Note: The quality system of Illumina HiSeq 2000(or 2500) is Phred+64, and the quality system of Illumina HiSeq 4000 is Phred+33. For the reads sequencing by Illumina HiSeq 4000, in considering of the compatibility of softwares used in our study, we will convert the quality system from Phred+33 to Phred+64 for both raw data and clean data.

2 What is TFIn molecular biology and genetics, a transcription factor (sometimes called a sequence-specific

DNA-binding factor) is a protein that binds to specific DNA sequences, thereby controlling the rate oftranscription of genetic information from DNA to messenger RNA. Transcription factors perform thisfunction alone or with other proteins in a complex, by promoting (as an activator), or blocking (as arepressor) the recruitment of RNA polymerase (the enzyme that performs the transcription of geneticinformation from DNA to RNA) to specific genes. See wiki for detailhttps://en.wikipedia.org/wiki/Transcription_factor.

3 RNA editing formatRNA editing list of each sample is stored in CNS format. See

http://soap.genomics.org.cn/soapsnp.html Output Format for detail.

4 Gene fusion formatGene fusion list of each sample is stored in tab-seperated text file. See

http://soap.genomics.org.cn/soapfuse.html Output Files for detail.

26/33

Page 28: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

5 DSG formatDifferentially Splicing Gene( DSG ) result of each compare plan is stored in tab-seperated text

file Files/BGI_result/3.DifferentiallySplicingGene/*/*.GeneDiffSplice.xls with the format descriped inTable2.

Table 2 Format of differentially splicing gene result list. (Download)

Field Description Notes

GeneID gene identity -

Chr chromosome -

Strand strand -

Control-IC inclusion junction counts for Control sample, replicates are separated bycomma

-

Control-SC skipping junction counts for Control sample, replicates are separated bycomma

-

Treat-IC inclusion junction counts for Treat sample, replicates are separated bycomma

-

Treat-SC skipping junction counts for Treat sample, replicates are separated bycomma

-

Pvalue statistical significance -

FDR false discovery ratio -

longExonStart the long exon start position on chromosome for A3SS and A5SSevent

longExonEnd the long exon end position on chromosome for A3SS and A5SSevent

shortExonStart the short exon start position on chromosome for A3SS and A5SSevent

shortExonEnd the short exon end position on chromosome for A3SS and A5SSevent

flankingExonStart the flanking exon start position on chromosome for A3SS and A5SSevent

flankingExonEnd the flanking exon end position on chromosome for A3SS and A5SSevent

1stExonStart the first exon start position on chromosome for MXE event

1stExonEnd the first exon end position on chromosome for MXE event

2ndExonStart the secend exon start position on chromosome for MXE event

2ndExonEnd the secend exon end position on chromosome for MXE event

riExonStart the intron-retained exon start position on chromosome for RI event

riExonEnd the intron-retained exon end position on chromosome for RI event

skipExonStart the skipped exon start position on chromosome for SE event

skipExonEnd the skipped exon end position on chromosome for SE event

upstreamExonStart the upstream exon start position on chromosome for RI and SE event

upstreamExonEnd the upstream exon end position on chromosome for RI and SE event

downstreamExonStart the downstream exon start position on chromosome for RI and SE event

downstreamExonEnd the downstream exon end position on chromosome for RI and SE event

LongExonTranscripts the transcripts that contain long exon, separated by comma for A3SS and A5SSevent27/33

Page 29: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

LongExonTranscripts the transcripts that contain long exon, separated by commaevent

ShortExonTranscripts the transcripts that contain short exon, separated by comma for A3SS and A5SSevent

1stExonTranscripts the transcripts that contain first exon, separated by comma for MXE event

2ndExonTranscripts the transcripts that contain secend exon, separated by comma for MXE event

RetainTranscripts the transcripts that contain intron-retained exon, separated by comma for RI event

AbandonTranscripts the transcripts that exclude intron-retained exon, separated by comma for RI event

InclusionTranscripts the transcripts that include certain exon, seperated by comma for SE event

SkippingTranscripts the transcripts that exclude certain exon, seperated by comma for SE event

6 Gene expression list formatGene expression result of each sample is stored in tab-seperated text file

Files/BGI_result/4.Quantify/GeneExpression/*.gene.fpkm.xls (* presents sample name) with theformat descriped in Table3.

Table 3 Format description of gene expression result list. (Download)

Field Description

gene_id gene ID number

transcript_id(s) trascript list of gene, seperated by comma

length length of gene after model regulation

expected_count support reads number to this gene after model regulation

FPKM FPKM value of this gene

7 DEG list formatThe result of differentially expressed genes for each control-treatment pairwise is stored in tab-

seperated text file Files/BGI_result/5.Quantify/DifferentExpressedGene/*.GeneDiffExpFilter.xls (*presents pairwise name) with the format description in Table4.

Table 4 Format description of DEGs screening result file. (Download)

Field Description

Unigene Unigene ID

Length Unigene length

Sample1-Expression Unigene expression of control sample(s)

Sample2-Expression Unigene expression of treat sample(s)

log2FoldChange(Sample2/Sample1) log2 transformed fold change between control and treat samples

Pvalue Statistic of pvalue(PossionDis or DEseq2 method used)

FDR Statistic of false discovery rate(PossinoDis method used)

Padj Statistic of adjusted pvalue(DEseq2 method used)

PPEE Statistic of posterior probability of being equivalent expression(EBseq method used)

Probability Statistic of probability of being DEG(NOIseq method used)

Up/Down-Regulation(Sample2/Sample1) Flags indicate up-regulated DEG(Up) or down-regulated DEG(Down) or non-DEG(*)

8 MA plot28/33

Page 30: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

The MA plot is a plot of the distribution of the red/green intensity ratio ('M') plotted by the averageintensity ('A'). M and A are defined by the following equations:

See wiki for detail https://en.wikipedia.org/wiki/MA_plot.

9 Volcano plotThe Volcano plot is a type of scatter-plot that is used to quickly identify changes in large

datasets, It plots significance versus fold-change on the y- and x-axes, respectively. See wiki fordetail https://en.wikipedia.org/wiki/Volcano_plot_(statistics).

10 Cluster list formatThe format of cluster list is described as Table5.

Table 5 Format description of DEGs clustering list. (Download)

Field Description

Unigene Unigene ID

A-VS-B log2FoldChange of A-VS-B

C-VS-D log2FoldChange of C-VS-D

... ...

11 VCF formatVariant Call Format (VCF) is a flexible and extendable format for variation data such as single

nucleotide variants, insertions/deletions, copy number variants and structural variants. See details atUCSC website http://genome.ucsc.edu/FAQ/FAQformat.html#format10.1

12 How to read DEG GO enrichment analysis resultMake sure that the computer has installed java and use IE brower to open GOView.html. The left

navigation includes three types of GO terms for each control-treatment pairwise (C: cellularcomponent, P: biological process, F: molecular function). Click one of them, the enriched GO termsresult will be listed as Figure3.

Figure 3 Significantly enriched GO terms in DEGs.

Column 1 is GO term name. Column 2 is the ratio of DEGs enriched to this GO term. Column 3 is the ratio of genes enrichedto this GO term in background database. Column 4 is Corrected P-value which indicates the degree of enrichment and the

29/33

Page 31: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

smaller Corrected P-value, the more significantly DEGs enriched to this GO term. The result list has been sorted byCorrected P-value. Column 5 is clustering of foldchange value for these enriched DEGs using the tools cluster [9] [10] and

javaTreeView [11].

Click the term name 'BLOC complex' in Figure3, you can go tohttp://amigo.geneontology.org/amigo for more information when the computer is Internet-connected. Click 'view genes' in Figure3, you can get gene IDs that enriched to this GO term asFigure4.

Figure 4 Gene ID list related to GO terms.

There are two DEGs enriched to the term 'BLOC complex': 63915, 100526837.

13 How to read DEG pathway enrichment analysis resultOpen html report for pathway enrichment result and the enriched KEGG pathways will be listed

as Figure5.

Figure 5 Pathway enrichment analysis of DEGs.

Column 1 is ordinal number. Column 2 is pathway name. Column 3 is the ratio of DEGs enriched to this pathway. Column 4is the ratio of genes enriched to this pathway in background database. Pvalue and Qvalue are both values that indicate thedegree of enrichment and Qvalue is corrected Pvalue. The smaller they are, the more significantly DEGs enriched to this

pathway. The result list has been sorted by Qvalue. The last column pathway ID corresponding to pathway name.

Click pathway name 'Leukocyte transendothelial migration' in Figure5, you can get gene IDsthat enriched to it as Figure6.

Figure 6 Gene ID list related to pathway.

There are 46 DEGs enriched to the pathway 'Leukocyte transendothelial migration'.

Furtherly, detecting the most significant pathways, the enrichment analysis of DEG pathway

30/33

Page 32: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

significance, allows us to see detailed pathway information in KEGG database. For example, clickingthe hyperlink on 'Leukocyte transendothelial migration' in Figure6 will get detailed information asshown in Figure7.

Figure 7 An example of KEGG pathway of 'Leukocyte transendothelial migration'.

Up-regulated genes are marked with red borders and down-regulated genes with green borders. Non-change genes aremarked with black borders. When mouse hover on border with red or green, the related DEGs appear on the top left. Clicking

gene name in the figure, the page will redirect to KEGG website if the computer is Internet-connected.

FAQsHow to understand the figure in randomness analysis? What's the criterion for randomness ?Randomness is one of criterions for sequencing quality. At present, there is no criterion to evaluate the randomness. Generally speaking,if the randomness is good, the reads would be evenly distributed on reference sequence.

How to see '.fq' files ?If you are using Linux or Unix environment, '.fq' files are opened by command 'less'. For windows environment, you have to unpack thefiles first, then use software ultraEdit to see '.fq' files. This software can be downloaded from here. We recommend using Linux or Unixenvironment.

In the Gene Expression Difference Analysis, for example A-VS-B, how to understand the up-regulated and down-regulated ?A-VS-B means sample A is control and sample B is case. In the corresponding files A-VS-B.GeneDiffExp.xls and A-VS-B.GeneDiffExpFilter.xls, if a gene is up-regulated means the expression of this gene is up-regulated in sample B compared to sample A.

In the figure of pathway enrichment analysis, why the number of gene is not equal to colored borders ?Because each border in figure represent one kind of enzyme, and the function of an enzyme is participation of several genes, so oneborder maybe related to many genes.

What is '.gff' and '.psl' file format ?Please scan the webpage here, which include detailed description of these format.

In this experiment unique match rate is about 40%, but it is reported in the paper that this rate can reach 90%. Sowhat's the reason for this ?This rate mainly related to the reference sequence. For model organisms, this rate is about 50%~60%. And for non-model organisms,

31/33

Page 33: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

unique match rate up to 40% is a good result.

when preparing library, why you use RNA fragmentation instead of cDNA fragmentation ?Please refer to the reference paper 'RNA-Seq: a revolutionary tool for Transcriptomics'.

How to extract mRNA from total RNA ?mRNA of eukaryotes is enriched by using the oligo (dT) magnetic beads and mRNA of prokaryotes is enriched just by removing rRNAfrom the total RNA.

Why use Bowtie to map reads to reference gene ? Did you do some comparison between SOAPaligner/SOAP2and other alignment tools ?Bowite performs better in transcription level comparing to other tools . And it's mapping result can be used to calculate expression throughRSEM, which produces good result that have a high correlation with standard qPCR.

References[1] Kim D, et al.(2015).HISAT: a fast spliced aligner with low memory requirements. Nature Methods 2015.[2] Pertea M, et al.(2015).StringTie enables improved reconstruction of a transcriptome from RNA-seqreads.Nature Biotechnology 2015.[3] Cole Trapnell, et al.(2012).Differential gene and transcript expression analysis of RNA-seq experimentswith TopHat and Cufflinks.Nature Protocols 7, 562-578 (2012)[4] McKenna A, et al.(2010).The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.Genome Res. 2010 Sep;20(9):1297-303.[5] Wenlong Jia, et al.(2013).SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data.Genome Biology 2013, 14:R12.[6] Shen S, et al.(2014).rMATS: Robust and flexible detection of differential alternative splicing from replicateRNA-Seq data.PNAS, 111(51):E5593-601.[7] Langmead B, et al.(2012).Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.[8] Li B, et al.(2011).RSEM: accurate transcript quantification from RNA-Seq data with or without a referencegenome.BMC Bioinformatics. 2011 Aug 4;12:323.[9] Eisen, M. B., et al. (2001). Cluster analysis and display of genome-wide expression patterns. Proc NatlAcad Sci USA, (1998)95(25): 14863-8. 2001.29: 1165-1188.[10] M. J. L. de Hoon, et al. (2004). Open Source Clustering Software.Bioinformatics, 20(9): 1453-1454.[11] Saldanha, A. J. (2004). Java Treeview--extensible visualization of microarray data. Bioinformatics,20(17): 3246-8.[12] Cock P., et al.(2010). The Sanger FASTQ file format for sequences with quality scores, and theSolexa/Illumina FASTQ variants. Nucleic Acids Research, 38(6): 1767-1771.[13] L. Kong, et al.(2007).CPC: assess the protein-coding potential of transcripts using sequence featuresand support vector machine.Nucleic Acids Res. 2007 Jul; 35(Web Server issue): W345-W349.[14] Michael I, et al.(2014).Moderated estimation of fold change and dispersion for RNA-seq data withDESeq2. Genome Biology, 15, pp. 550.[15] Leng N, et al.(2015).EBSeq: An R package for gene and isoform differential expression analysis ofRNA-seq data.Bioinformatics. 2013 Apr 15; 29(8): 1035-1043.[16] Tarazona S, et al.(2011).Differential expression in RNA-seq: a matter of depth. Genome research,21(12), pp. 4436.[17] Audic S, et al.(1997).The significance of digital gene expression profiles.Genome Res. 1997Oct;7(10):986-95.[18] Rice P, et al.(2000).EMBOSS: the European Molecular Biology Open Software Suite.Trends Genet.2000 Jun;16(6):276-7.[19] Mistry J, et al.(2013).Challenges in homology search: HMMER3 and convergent evolution of coiled-coilregions.Nucleic Acids Res. 2013 Jul;41(12):e121.

32/33

Page 34: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

深圳华⼤基因股份有限公司 400-706-6615 ©�2017�BGI�All�Rights�Reserved.

2017 Copyright BGI All Rights Reserved 粤ICP备 12059600 Technical Support E-mail:[email protected] Website: www.bgitechsolutions.com

33/33

Page 35: Resequencing Report] HUMaaaE [Transcriptomexbio1.genomics.cn/NGS/report/HUMaaaE/HUMaaaE/report/report_en.pdf · HUMaaaE [Transcriptome Resequencing Report] ... genome and reconstruct

联系我们

服务热线:400-706-6615邮������箱:[email protected]

⽹址:www.bgitechsolutions.com地址:⼴东省深圳市盐⽥区北⼭⼯业区11栋(邮编:518083)

本结题报告仅供客⼾学习、交流和研究使⽤,请勿⽤于商业⽤途,违者必究。版权声明:本结题报告版权属于深圳华⼤基因股份有限公司所有,未经本公司书⾯许可,任何其他个⼈或组织均不得以任何形式将本结题报告中的各项内容进⾏复制、拷⻉、编辑或翻译为其他语⾔。本结题报告中的所有商标或标志均属于深圳华⼤基因股份有限公司及其提供者所有。2017年01⽉

基因科技造福⼈类