생물학 연구를 위한 컴퓨터 활용기술 제 10강

Computational Skill for Modern Biology Research

Department of BiologyChungbuk National University

10th Lecture 2015.11.17

NGS Analysis III : RNA quantification with kallisto & DE

Syllabus주 수업내용1주차 Introduction : Why we need to learn this stuff?

2주차 Basic of Unix and running BLAST in your PC

3주차 Unix Command Prompt II and shell scripts

4주차 Basic of programming (Python programming)

5주차 Python Scripting II and sequence manipulations

6주차 Ipython Notebook and Pandas

7주차 Basic of Next Generation Sequencings and Tutorial

8주차9주차 Next Generation Sequencing Analysis I

10주차 Next Generation Sequencing Analysis II

11주차 Next Generation Sequencing Analysis III

12주차 Bioconductor I

13주차 Bioconductor II

14주차 Network analysis

Conventional RNA-Seq Analysis

Sequencing Read Mapping on reference genome

Read Quantifications

Calcuration of FPKM

Differential Expression Analysis

Bottleneck

Too much time consumptions

30 million paired-end. All processing was done using 20 cores with programs being run with 20 threads

http://arxiv.org/pdf/1505.02710v2.pdf

Even in 20 core CPU server, it tooks serious time..

More efficient way to quantify transcriptome needed..

https://pachterlab.github.io/kallisto/Kallisto : Near-optimal RNA-Seq quantification http://arxiv.org/abs/1505.02710

Do we really need to align RNA sequencing read to Genome?

Most transcriptome size is far smaller than genome

Sometime we only need to know which reads is corresponding to the specific isoforms

Download and install

https://pachterlab.github.io/kallisto/download.html

cd ~wget https://github.com/pachterlab/kallisto/releases/download/v0.42.4/kallisto_mac-v0.42.4.tar.gztar –xvzf kallisto_mac-v0.42.4.tar.gz

Then add kallisto path into PATH (~/.bash_profile)

kallisto 0.42.4

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

index Builds a kallisto index quant Runs the quantification algorithm h5dump Converts HDF5-formatted results to plaintext version Prints version information

Transcriptome index

You need to generate index for transcriptome

http://bio.math.berkeley.edu/kallisto/transcriptomes/

It is just fasta file contains all of mRNA in your genome.

Download mouse transcriptome

http://bio.math.berkeley.edu/kallisto/transcriptomes/Mus_musculus.GRCm38.rel79.cdna.all.fa.gz

Generate index

kallisto index -i mouse Mus_musculus.GRCm38.rel79.cdna.all.fa.gz

Index Name Transcriptome fasta file

Now it is time to download some NGS data from SRA archive.

http://sra.dnanexus.com

Input Keywords

Confine search type as ‘Transcriptome Analysis’

Select Run and download SRA URLS

Open download_sra_urls txt file

Add wget –c

bash download_sra_urls.txt --2015-11-16 13:28:38-- ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR128/SRR1286228/SRR1286228.sra => ‘SRR1286228.sra’Resolving ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)... 130.14.250.7, 2607:f220:41e:250::13Connecting to ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)|130.14.250.7|:21... connected.Logging in as anonymous ... Logged in!==> SYST ... done. ==> PWD ... done.==> TYPE I ... done. ==> CWD (1) /sra/sra-instant/reads/ByRun/sra/SRR/SRR128/SRR1286228 ... done.==> SIZE SRR1286228.sra ... 3488089695==> PASV ... done. ==> RETR SRR1286228.sra ... done.Length: 3488089695 (3.2G) (unauthoritative)

SRR1286228.sra 0%[ ] 67.01K 48.8KB/s

Download SRA

Convert them as fastq

fastq-dump –split-files –gzip SRR1286228.sra

Run kallisto

kallisto quant -t 4 -b 100 -o SRR1171560 –i mouse.idx SRR1171560_1.fastq.gz SRR1171560_2.fastq.gz

Fastq file

Generated index

DirectoryOutput saved

(Two file : paired end)

[quant] fragment length distribution will be estimated from the data[index] k-mer length: 31[index] number of targets: 88,198[index] number of k-mers: 82,099,631[index] number of equivalence classes: 297,305[quant] running in paired-end mode[quant] will process pair 1: SRR1171560_1.fastq.gz SRR1171560_2.fastq.gz[quant] finding pseudoalignments for the reads ...

Depends on transcriptome and Read sizes, it would took 5-10 min in ordinary PC

Transcripts abundance

It is tab-seperated text file. Like other bioinformatics data, you can read them in ipython Notebook

Analyze transcript abundance with iPython and Pandas

Gene Name : Ensembl Read count on gene tpm

Tpm : transcripts number per million transcript

Sort based on the tpm

Highly expressed

Lower expressed

Without transcript annotation, it is difficult to understand.

Transcripts annotation data

Transcript id was done as ensembl

Go to http://asia.ensembl.org/index.htmlredirect=no

Select ‘BioMart’

Select Ensembl Gene 82

Select Mouse

Then click ‘Attributes’

Add informations you want to see

Add more informations

Press ‘Go’

File will be saved and download as mart-export.txt

mart-export.txt

Copy to directory where iPython Notebook is

Read mart_export.txt in annotation DataFrame

Now we have two dataFrame to connect

Merge annotation dataframe into abundance based on transcripts id

Some of data has ‘NaN’ (Not available). Fill them as ‘blank’

Save them as ‘abundance_plus’.

Some counting : transcripts TPM>1 TPM>10

Sort based on TPM

Same as before. But this time we have gene name and Descriptions

Most abundant transcripts in your samples..

If we want to find all of transcripts involved in the specific biological process?

GO : Gene Ontology

Keyword and classification systems of biological entity (Protein, gene, transcripts)

http://geneontology.org

http://www.ebi.ac.uk/QuickGO/GProtein?ac=O88569

Many keyword is associated withSpecific Proteins

1. Retrieve data of all transcripts and its corresponding Go term association

2. Search using Go term name

3. Find out list of genes containing specific GO terms

4. Find out genes in transcripts abundance table

Process

GO Term – Transcript associations

Go back to ensembl – BioMart, Select Gene and Mus musculus

Results and Export data

‘Press Go’ and download file

Open mart_export.txt

Rename it as GO.txt and copy to working directory where iPython notebook is

Many Terms is asociated in a gene or transcript

Read GO – Transcript Associations

Transcripts associated with Go Term Name ‘Cell Cycle’

Search ‘Go Term Name’ contains Cell Cycle

Find out unique Transcript ID

Save them as cellcycleDF

cellcycleDF

Transcripts associated GO Term‘cell cycle’

Abundance_plus

Whole Transcriptome

TranscriptomeAssociated with ‘Cell Cycle’

transcriptJoin based on target_id

Only include Common data

Using two GO term Name

“I want to search transcripts associated with ‘cell cycle’ and ‘actin’”

First, generate DataFrame containing transcripts associated with ‘actin’

cellcycleDFactinDF

205 Transcripts

Differential Gene Expression (DGE)

Observing one transcriptome is informative..

But comparing two or more transcriptome would be more informative..

Differential Stages..

WT vs Mutant?Different treatment

You may think like this..

Quantificaiton of Each Samples

Sample A Sample B

Just find out Gene lists higher at Sample A. Simple!

Not really…

Two factors

Repeat

Although RNA-Seq contain numerous informations, single RNA-Seq is just ONE experimentsYou need to repeat them and show statistical significance between them!

Multiple Comparison

“OK. We repeated treated and control for three time each. Compare TPM of each genesAnd do statistical test for each seperately and if p<0.05, it is significantly different”

“If you compare many thing simulataneously, something should be different”

“If you have many comparison, you should adjust stringency higher”

-> Not good

Inference of Differential Expression is not trivial

In the case of kallisto-generated quantification, uses sleuth

http://pachterlab.github.io/sleuth/

Installation of R and Rstudio

First, install R https://www.r-project.org

Rstudio is environment for R and Applications

https://www.rstudio.com

Launch RStudio

Install Sleuth and Dependency

source("http://bioconductor.org/biocLite.R")

biocLite("rhdf5")

install.packages("devtools")

devtools::install_github("pachterlab/sleuth")

Differential Expression Datasets

Three datasets for mouse oocytes (SRR1286228, SRR1286230, SRR1286231)

http://sra.dnanexus.com/studies/SRP009468/runs

Three datasets for Two cell mouse embryos(SRR385622, SRR385623, SRR385624)

Download and convert as fastq

Convert as fastq and run kallisto

And make text file describing samples

MII Oocyte datasets (3 set)

2 Cell datasets (3 set)

And organize kallisto output directory like this

And save them as study_design.txt

Download this scripts and save into working directory

https://gist.github.com/madscientist01/a49574b7fba18e65818a

Change here as your working directory

Study-design should be same directory

https://gist.github.com/madscientist01/a49574b7fba18e65818a

Open Anal.R file

If analysis is done without problem….

Quality Check

Variations between repeats?

Much Better Correlation between repeats

Higher ExpressionLower expression

High Fold Change

No Change

Differentially expressed

Not Differentially expressed

Q : False Discovery Rate

Log Fold change

Search by gene name

Gene Level Expressions

One Isofomrs

Second Isoform has much higher Expression levels.

Abnormality?

Download table and you can analye them in Pandas..

Continues…

Assignments

• Install kallisto, sleuth (R and R Studio)• Download sra datasets SRR385622• Run Kallisto for the sample

생물학 연구를 위한 컴퓨터 활용기술 제 10강

Education

elearning.kocw.net › contents4 › document › lec › 2012 › KonKuk_g... 시스템 생물학 기초-1Science (1999), 286, 2165 Mycoplasma genitalium 에는517개의유전자가있는데,

119 구조대대응화학, 생물학, 방사선및핵물질사고의역학적특성: … · 260 대한응급의학회지제27 권제3 호 원저 Volume 27, Number 3, June, 2016 EMS

How to start a startup 1-10강

겐트대학교 글로벌캠퍼스 - ghent.ac.kr · 겐트대 글로벌캠퍼스 신입생은 입학 후 첫 2년간 생물학, 화학, 물리학, 수학, 정보학, 대학영어

대용량 실험데이터 국내외 활용 동향 · 2021. 4. 5. · -고전적 생물학 분야가 ict와 융합하면서 그동안 미답의 영역에 이르렀던 유전체 연구에

2013 서울대학교 연구활동 - snu.ac.kr공학 & 기술 화학공학 17 생명과학 & 의학 생물학 50 토목·구조공학 38 약학 22 전기·전자학 39 농업학 43

10강 육종방법개괄 도입육종 분리육종cmb.snu.ac.kr/bod1/pds/lectures/10강_육종방법 개괄_분리... · • 1694 독일인Camerarius,R. --- 식물의유성생식발견(hemp,

10강 화학적 화학적진화SNUk+SNU046_101k+2018_T1+type@... · 2018. 2. 28. · INTRODUCTION Stanley L. Miller, A Production of Amino Acids under Possible Primitive Earth Conditions,

10강. 제조 및 규정

생물학 연구를 위한 컴퓨터 사용기술 제 6강

생물학 연구를 위한 컴퓨터 사용기술 제 7강

김동수 교수의$요한이 말하는$예수 10강 예수의 고별 …image.cgntv.net/file/bible/KimDongsu_10.pdf- 1 - 김동수 교수의$요한이 말하는$예수 10강 예수의

생물학 연구를 위한 컴퓨터 사용기술 제 3강

웹2.0과 인터넷 커뮤니케이션 10강

데이터 분석 실무 10강

12. 종자의 생물학과 중요성cmb.snu.ac.kr/bod1/pds/lectures/05강_종자의 생물학... · 2019. 9. 22. · NR Shom ura 2008 *GW2 GW RING-type E3 ubiquitin ligase NR Song

4차산업혁명과에너지시스템kec.kea.kr/Semina.pdf4 차산업혁명 World Economic Forum (Jan. 2016) “…4 차산업혁명은 3차산업혁명을기반으로디지털, 생물학,

유아언어 교육: 10강 - elearning.kocw.netelearning.kocw.net/contents4/document/lec/2012/KonKuk_g... · 2013. 7. 8. · 효율적인 읽기지도 방법 모색 행동주의 이론에

생물학 연구를 위한 컴퓨터 사용기술 제 1강

프랑스어I - KOCWcontents.kocw.net/KOCW/document/2015/chungbuk/leeeunmi/... · 2016. 9. 9. · Microsoft PowerPoint - 프랑스어I-이러닝용-10강-9과.pptx Author: 조종실