65
Computational Skill for Modern Biology Research Department of Biology Chungbuk National University 8th Lecture 2015.11.3 NGS Analysis I : align NGS read into reference geno

생물학 연구를 위한 컴퓨터 활용기술 8강

Embed Size (px)

Citation preview

Page 1: 생물학 연구를 위한 컴퓨터 활용기술 8강

Computational Skill for Modern Biology Research

Department of BiologyChungbuk National University

8th Lecture 2015.11.3

NGS Analysis I : align NGS read into reference genome

Page 2: 생물학 연구를 위한 컴퓨터 활용기술 8강

Syllabus주 수업내용1 주차 Introduction : Why we need to learn this stuff?

2 주차 Basic of Unix and running BLAST in your PC

3 주차 Unix Command Prompt II and shell scripts

4 주차 Basic of programming (Python programming)

5 주차 Python Scripting II and sequence manipulations

6 주차 Ipython Notebook and Pandas

7 주차 Basic of Next Generation Sequencings and Tutorial

8 주차9 주차 Next Generation Sequencing Analysis I

10 주차 Next Generation Sequencing Analysis II

11 주차 R and statistical analysis

12 주차 Bioconductor I

13 주차 Bioconductor II

14 주차 Network analysis

Page 3: 생물학 연구를 위한 컴퓨터 활용기술 8강

What we can do with NGS data

ResequencingDe novo genome sequencing

Is there reference sequence for your favorite organism?

Yes No

NGS Sequencing Data

Sequence Assembly

Output : Sequence Contigs

Alignment with reference genome

Output : variants (SNP, Structural Variations)

Gene PredictionsFunctional Classifications…

Association study with phenotypes

Page 4: 생물학 연구를 위한 컴퓨터 활용기술 8강

Resequencing

Reference sequences : well-estabilished genome sequence

We are interested in understanding genome level differences

Snyder M et al. Genes Dev. 2010;24:423-431

SNP/Indel

Phased SNP

Deletion

Insertion

Inverstion

Page 5: 생물학 연구를 위한 컴퓨터 활용기술 8강

ACGTTTGGATACTGCAAACCTATG

ACGTTTGTATACTGCAAACATATG

SNP (Single Nucleotide Polymorphisms)

• Change in Single Nucleotide Sequence

• When we compare with Human reference sequences, individual Human has 3 – 4 million SNPs

• Some of them is very frequent, while others are very rare

- Common Variant (20-40% frequencies in Populations)- Rare Variant (less than 1%_

Page 6: 생물학 연구를 위한 컴퓨터 활용기술 8강

SNPs vs. SNVsBoth are found as single nucleotide variances

• SNP

– Known variant in the specie (Well Characterized)– Known variants exists in specific frequency in Populations– Verified in Population– Resistered in dbSNP (http://www.ncbi.nlm.nih.gov/snp)

• SNV

– Specific variants found on the specific person (Not well characterized)– Very low frequency– Not well characterized

Really a matter of frequency of occurrence

http://ccsb.stanford.edu/education/Nair_NGS.pptx

Single Nucleotide Variances

Page 7: 생물학 연구를 위한 컴퓨터 활용기술 8강

TGCAAACCTATG

Indel (Insertion/Deletion)

• Deltion or addition of base (less than 1kb)

- 300,000-600,000 indels per person

• Large Scale Structural Variation (more than 2kbp

- more than 1,000 per person

TGCAAAC-TATGTGCAAACC-TATGTGCAAACCCTATG

Page 8: 생물학 연구를 위한 컴퓨터 활용기술 8강

Today, we will learn how to find these variants from NGS sequencing

- Reference Genome Sequences (Fasta Format)- Sequence Data (Fastq format)

Software

-bwa, samtools, bcftools

• Most software is unix based• In the case of big eucaryotic genomes, it is difficult to run in ordinary PC• But in small eucaryote or bacteria, it would be ok…

WorkFlow

Sequencing DataFastQ

ReferenceGenome Sequence

Alignment File(sam format)

Mapping

Page 9: 생물학 연구를 위한 컴퓨터 활용기술 8강

Some of informations for NGS

Single Read (SR) or Paired End (PE)

Read Length

Depth of Coverage (DNA)

Page 10: 생물학 연구를 위한 컴퓨터 활용기술 8강

SRA

http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement

Sequence Read Archive : Repository for NGS Data

Page 11: 생물학 연구를 위한 컴퓨터 활용기술 8강
Page 12: 생물학 연구를 위한 컴퓨터 활용기술 8강

http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP018525

During this study, they performed 47 RNA sequencing (160.8Gbp)

Page 13: 생물학 연구를 위한 컴퓨터 활용기술 8강

SourcesAccessions Type of Experiments

Page 14: 생물학 연구를 위한 컴퓨터 활용기술 8강

Install SRA Toolkit

To download NGS data archived in NCBI/SRA, you need to download SRA Toolkit

http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software

Page 15: 생물학 연구를 위한 컴퓨터 활용기술 8강

tar -xvzf sratoolkit.2.5.4-mac64.tar.gz

Extract archive

cd binpwd

(In the case of mac)

Page 16: 생물학 연구를 위한 컴퓨터 활용기술 8강

Setup sratoolkit in your PATHAdd These line into your .bash_profile in home directory

Setup path

Page 17: 생물학 연구를 위한 컴퓨터 활용기술 8강

Download sra file

Let’s download some of datafile (It is BIG)

prefetch ERR560539 (SRA id)

Maximum file size download limit is 20,971,520KB

2015-11-02T01:09:26 prefetch.2.5.4: 1) Downloading ‘ERR560539 '...2015-11-02T01:09:26 prefetch.2.5.4: Downloading via http...2015-11-02T01:23:08 prefetch.2.5.4: 1) 'SRR032988' was downloaded successfully

File will be saved in ~/ncbi/public/sra

Page 18: 생물학 연구를 위한 컴퓨터 활용기술 8강

Convert sra file into FASTQ file

fastq-dump --split-files ERR560539 Read 1887328 spots for ERR560539 Written 1887328 spots for ERR560539

<sra id>

ls ERR560539 _1.fastq ERR560539 _2.fastq Paired End reads

Reverse

Forward5’ 3’5’3’

Page 19: 생물학 연구를 위한 컴퓨터 활용기술 8강

See end of fastq file

Quality

Sequence

Size of file

About 2.9Gb

Page 20: 생물학 연구를 위한 컴퓨터 활용기술 8강

Quality Control of Fastq using FASTQC

Download and install FastQC

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Open Fastq file

Page 21: 생물학 연구를 위한 컴퓨터 활용기술 8강
Page 22: 생물학 연구를 위한 컴퓨터 활용기술 8강
Page 23: 생물학 연구를 위한 컴퓨터 활용기술 8강

Install bwa, samtools, bcftools

bwa: short illumina read aligner to reference geome sequences

Genome sequence

Sequencing Data

Find out matching, and align sequences

samtools : convert data format find out variants in concert with bcftools

Page 24: 생물학 연구를 위한 컴퓨터 활용기술 8강

Install bwa, samtools, bcftools

1. Download source files and compile it based on the instructions

2. Install via Homebrew (Mac) or apt-get (Ubuntu linux)

https://github.com/lh3/bwa/https://github.com/samtools/samtools/https://github.com/samtools/bcftools

http://gatkforums.broadinstitute.org/discussion/2899/howto-install-all-software-packages-required-to-follow-the-gatk-best-practices

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Install Homebrew

brew tap homebrew/science

http://alejandrosoto.net/blog/2014/01/22/setting-up-my-mac-for-scientific-research/

brew install bwabrew install samtoolsbrew install bcftools

Page 25: 생물학 연구를 위한 컴퓨터 활용기술 8강

samtools

Page 26: 생물학 연구를 위한 컴퓨터 활용기술 8강

bwa

Page 27: 생물학 연구를 위한 컴퓨터 활용기술 8강

bcftools

Page 28: 생물학 연구를 위한 컴퓨터 활용기술 8강

What will do..

Align sequencing reads in reference genome

First, we will download Reference Genomes

https://support.illumina.com/sequencing/sequencing_software/igenome.html

We will use Saccharomyces cerevisiae genome (sacSer3)

Download this filehttps://support.illumina.com/sequencing/sequencing_software/igenome.html

Download genome file and genome sequence in current directory

Page 29: 생물학 연구를 위한 컴퓨터 활용기술 8강

tar -xvzf Saccharomyces_cerevisiae_UCSC_sacCer3.tar.gzcp ./Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BWAIndex/genome.fa .mv genome.fa yeast

Extract reference genome

First, you need to generate index file for genome sequencebwa index yeast

You can think ‘index’ as something like address book in genome for fast access..

Page 30: 생물학 연구를 위한 컴퓨터 활용기술 8강

Download ERR560539.sraPrefetch ERR560539fasta-dump –split-files ERR560539

Then, download NGS sequence Data to analysis

Saccharomyces cerevisiae seperated from wine

Page 31: 생물학 연구를 위한 컴퓨터 활용기술 8강

Running bwa (Align NGS reads into Reference)

bwa mem -t 4 yeast ERR560539_1.fastq ERR560539_2.fastq > ERR560539.sam

memMethod for alignment (if NGS sequences is bigger than 50bp, select this)

Number of ThreadIf cpu of your computer (sever) is 4 core, uses –t 4

Two fastq files contains NGS sequencing

Output was saved as ERR560539.sam file

For Yeast alignments it takes 259.789secFor 4 core computer

Page 32: 생물학 연구를 위한 컴퓨터 활용기술 8강

Sam file

Write down the location of each reads in references file

Starting PositionRead Name

Page 33: 생물학 연구를 위한 컴퓨터 활용기술 8강

Convert Sam file to Bam file and indexing

samtools view -b -@ 4 ERR560539.sam > ERR560539.bam

samtools sort -@ 4 ERR5605392.bam ERR560539.sorted

Sort bam file

Convert sam to bam (binary sam file)

Generate index filesamtools index 941832.sorted.bam

941832.bam941832.sam941832.sorted.bam941832.sorted.bam.bai

output ‘bam’ file Uses 4 threads (for 4 Core CPU)

Uses 4 threads (for 4 Core CPU)

Now what?

Page 34: 생물학 연구를 위한 컴퓨터 활용기술 8강

Let’s visualize data : Integrated Genome Viewer

https://www.broadinstitute.org/igv/download

Page 35: 생물학 연구를 위한 컴퓨터 활용기술 8강

https://www.broadinstitute.org/software/igv/download

Page 36: 생물학 연구를 위한 컴퓨터 활용기술 8강

In our examples, select sacCer3

Page 37: 생물학 연구를 위한 컴퓨터 활용기술 8강

Zoom in Zoom OutSelect chromosome

Locations

Page 38: 생물학 연구를 위한 컴퓨터 활용기술 8강

Load bam file

File->Load from file-> Select yeast.sorted.bam

Page 39: 생물학 연구를 위한 컴퓨터 활용기술 8강

SNP

Gene

Zoom it

Page 40: 생물학 연구를 위한 컴퓨터 활용기술 8강

Reference :C Sequenced : T

Page 41: 생물학 연구를 위한 컴퓨터 활용기술 8강

Missing in Sequenced Genome?

Low sequencing Depth

Page 42: 생물학 연구를 위한 컴퓨터 활용기술 8강
Page 43: 생물학 연구를 위한 컴퓨터 활용기술 8강

Find out Variants

samtools mpileup -g -f yeast yeast.sorted.bam > yeast.bcf

Examine every position in genome and check alignmentFind out the possibility of alternative allele

bcftools call -c -v yeast.bcf > yeast.vcf

Write out variant as yeast.vcf

Open yeast.vcf in nano editor

Page 44: 생물학 연구를 위한 컴퓨터 활용기술 8강

Header

Variants

Page 45: 생물학 연구를 위한 컴퓨터 활용기술 8강

DP : Raw read depth….How many sequence reads support these variation?

<ID=DP4: Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">

Page 46: 생물학 연구를 위한 컴퓨터 활용기술 8강

Visualize

‘Load from Files’ in IGV

Select VCF file (yeast.vcf)

Page 47: 생물학 연구를 위한 컴퓨터 활용기술 8강

SNV

Page 48: 생물학 연구를 위한 컴퓨터 활용기술 8강

Data mining from variant data

VCF file is just text file. So we can handle them with unix utility and Pandas

head -n 50 yeast.vcf Print out first 50 line in yeast.vcf file

Headers start in ‘##’

We want to remove these lines started with ##. How?

Page 49: 생물학 연구를 위한 컴퓨터 활용기술 8강

grep –v “^##” Display lines except start with ‘##’

And Save it as yeast2.vcf

We can uses premitive filtering using grep

grep 'chrIX' yeast2.vcf | wc -l 2818

Display variants in ChrIX and count lines

Filtering with Pandas

Page 50: 생물학 연구를 위한 컴퓨터 활용기술 8강

Data mining using ipython Notebook

Page 51: 생물학 연구를 위한 컴퓨터 활용기술 8강
Page 52: 생물학 연구를 위한 컴퓨터 활용기술 8강

Informations are stored as DP=14;VBD=2.6447e=06…We want to convert them as columns in dataFrame. How?

Page 53: 생물학 연구를 위한 컴퓨터 활용기술 8강

Define functions

DP=81;VDB=5.92922e-11;SGB=-0.693147;MQSB=1;MQ0.Convert string as dictionary

{‘DP’:81, ‘VDB’:5.92922e-11, ‘SGB’:-0.693147…}

Page 54: 생물학 연구를 위한 컴퓨터 활용기술 8강

View single column as series

Page 55: 생물학 연구를 위한 컴퓨터 활용기술 8강

Apply ‘split’ functions in each row

Convert as list

Page 56: 생물학 연구를 위한 컴퓨터 활용기술 8강

Generate DataFrame from list

Page 57: 생물학 연구를 위한 컴퓨터 활용기술 8강

Save as new dataframe named as info

Select two columns in info (DP, MQ) and add into vcf

Page 58: 생물학 연구를 위한 컴퓨터 활용기술 8강

Filter DP (read depth) is higher than 50, MQ (Mapping Quality) is higher than 30

Page 59: 생물학 연구를 위한 컴퓨터 활용기술 8강

How many filtered SNV is found on ‘chrI’?

Unfiltered

Page 60: 생물학 연구를 위한 컴퓨터 활용기술 8강

Save back filtered VCF data…

Save as vcf3.vcf

grep "^##" yeast.vcf > header.vcf Extract Header regions in VCF

cat header.vcf vcf3.vcf > filtered.vcf Attach Header back

Open in IGV and compare original variant calling and filtered one..

Page 61: 생물학 연구를 위한 컴퓨터 활용기술 8강

Filtered

Original

Page 62: 생물학 연구를 위한 컴퓨터 활용기술 8강

Common Question Examples..

1. Find out all SNV present on Exon

2. Find out SNV present on Promoter Regions on the Genes

3. Find out SNV present on the specific genes of interest

4. Filter out SNV which causes Loss of Functions on genes

…You need another sets of tools to answer these questions

We will look in the next lectures

Page 63: 생물학 연구를 위한 컴퓨터 활용기술 8강

SNV FilteringPre-processing in the mapping phase and SNV filtering help minimize false positives• Absent in dbSNP• Exclude LOH events• Retain non-synonymous• Sufficient depth of read coverage• SNV present in given number of reads• High mapping and SNV quality• SNV density in a given bp window• SNV greater than a given bp from a

predicted indel • Strand balance/bias• Concordance across various SNV callers

http://ccsb.stanford.edu/education/Nair_NGS.pptx

Page 64: 생물학 연구를 위한 컴퓨터 활용기술 8강

Variant Annotation• 실제 찾아진 Variant 에 대한 해석• SeattleSeq

– annotation of known and novel SNPs – includes dbSNP rs ID, gene names and accession

numbers, SNP functions (e.g., missense), protein positions and amino-acid changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical association

• Annovar– Gene-based annotation– Region-based annotations– Filter-based annotation

http://snp.gs.washington.edu/SeattleSeqAnnotation/http://www.openbioinformatics.org/annovar/

http://ccsb.stanford.edu/education/Nair_NGS.pptx

Page 65: 생물학 연구를 위한 컴퓨터 활용기술 8강