32
Next Generation Sequencing for Invertebrate Virus Discovery -a practical approach Sijun Liu & Bryony C. Bonning Iowa State University, USA 8-14-2013 SIP Pittsburgh

Next generation sequencing Technology for virus · PDF fileNext Generation Sequencing for Invertebrate Virus Discovery -a practical approach Sijun Liu & Bryony C. Bonning Iowa State

Embed Size (px)

Citation preview

Next Generation Sequencing for Invertebrate Virus Discovery

-a practical approach

Sijun Liu amp Bryony C Bonning

Iowa State University USA 8-14-2013 SIP Pittsburgh

Outline

bull Introduction Why use NGS ndash Traditional approach for virus discovery ndash Next Generation Sequencing (NGS) ndash Advantages of NGS for virus discovery

bull How itrsquos done ndash Sample selection ndash Sequencing library preparation ndash Sequencing method ndash Assembly of sequencing reads ndash Identification of viral sequence ndash Assembly of viral genome

Insect viruses detecteddiscovered by use of NGS

Liu S Vijayendran D Bonning BC 2011 3(10)1849-69

Collect samples that show disease symptoms

Isolate viruses

Observe virus particles

Identify viral genomes

Clone genomic DNARNA - sequence (Sanger sequencing) - assemble viral genome

Traditional Approach for Virus Discovery

Advantages of NGS for Virus Discovery

bull Many viruses are latent or asymptomatic

bull NGS can identify viral sequences without background information on viruses

bull Viral genomes are assembled de novo without reference sequences

bull NGS has revolutionized virus discovery

sgRNA ()

4795 nt CP

pRdRP

VPg

CP

(P145)

P 10

RTD

P28 RTD () P35

UAG

Aphis glycines virus (AGV) -assembled from transcriptome

Similar to tetraviruses

Structurally resemble luteoviruses (plant virus)

A new insect virus with tetravirus-like RdRp and plant virus-like capsid protein

Outline

bull Introduction Why use NGS ndash Traditional approach for virus discovery ndash Next Generation Sequencing (NGS) ndash Advantages of NGS for virus discovery

bull How itrsquos done ndash Sample selection ndash Sequencing library preparation ndash Sequencing method ndash Assembly of sequencing reads ndash Identification of viral sequence ndash Assembly of viral genome

Sample Selection

bull Small sample size (10 ug or less RNA adequate)

-but the more the better

bull Tissue vs whole organism

-sequencing depth

bull Virus purification

-helps to identify full-length sequence

-better approach for DNA viruses

Sequencing Technologies

o Short reads (35-250 nt)

1 Genome Analyzer IIx (GAIIx) HiSeq2000 HiSeq2500 MiSeq ndash Illumina

(Hiseq2000 capable of up to 600Gb per run)

1 SOLiD 5500xl System ndash Applied Biosystems

2 HeliScopetrade Single Molecule Sequencer - Helicos

o Long reads (400-20000 nt)

1 Genome Sequencer FLX System (454) ndash Roche

2 PacBio RS - Pacific Bioscience

3 Personal Genome Machine Ion Proton - Ion Torrent

4 GridION ndash Oxford Nanopore

Preparation of Sequencing Library

Library type Viral genomes Sequence recovery

mRNA DNARNA +++ possible full-length

Small RNA DNARNA +++

DNA DNA +++ possible full-length

DNA or RNA isolated from viruses

DNARNA +++++ full-length

mRNA purification may result in loss of sequences for viruses that lack polyA tails

AGV assembled from different sequencing samples

RNA isolated from gut

RNA isolated from whole aphid with 2 rounds polyA purification

RNA isolated from whole aphid with 1 round polyA purification

Green + strand Red - strand

Assembly

DNARNA contigs

RNADNAsmall RNA Reads

Host Genome

Known viruses

New viruses in known genera

Complete viral genomes

New viruses in new genera

Nucleotide database By BlastN

Protein database By BlastX

PCR RT-PCR

NGS for Virus Discovery

Modified from Ding amp Lu 2011 Curr Opin Virol 1533-544

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Outline

bull Introduction Why use NGS ndash Traditional approach for virus discovery ndash Next Generation Sequencing (NGS) ndash Advantages of NGS for virus discovery

bull How itrsquos done ndash Sample selection ndash Sequencing library preparation ndash Sequencing method ndash Assembly of sequencing reads ndash Identification of viral sequence ndash Assembly of viral genome

Insect viruses detecteddiscovered by use of NGS

Liu S Vijayendran D Bonning BC 2011 3(10)1849-69

Collect samples that show disease symptoms

Isolate viruses

Observe virus particles

Identify viral genomes

Clone genomic DNARNA - sequence (Sanger sequencing) - assemble viral genome

Traditional Approach for Virus Discovery

Advantages of NGS for Virus Discovery

bull Many viruses are latent or asymptomatic

bull NGS can identify viral sequences without background information on viruses

bull Viral genomes are assembled de novo without reference sequences

bull NGS has revolutionized virus discovery

sgRNA ()

4795 nt CP

pRdRP

VPg

CP

(P145)

P 10

RTD

P28 RTD () P35

UAG

Aphis glycines virus (AGV) -assembled from transcriptome

Similar to tetraviruses

Structurally resemble luteoviruses (plant virus)

A new insect virus with tetravirus-like RdRp and plant virus-like capsid protein

Outline

bull Introduction Why use NGS ndash Traditional approach for virus discovery ndash Next Generation Sequencing (NGS) ndash Advantages of NGS for virus discovery

bull How itrsquos done ndash Sample selection ndash Sequencing library preparation ndash Sequencing method ndash Assembly of sequencing reads ndash Identification of viral sequence ndash Assembly of viral genome

Sample Selection

bull Small sample size (10 ug or less RNA adequate)

-but the more the better

bull Tissue vs whole organism

-sequencing depth

bull Virus purification

-helps to identify full-length sequence

-better approach for DNA viruses

Sequencing Technologies

o Short reads (35-250 nt)

1 Genome Analyzer IIx (GAIIx) HiSeq2000 HiSeq2500 MiSeq ndash Illumina

(Hiseq2000 capable of up to 600Gb per run)

1 SOLiD 5500xl System ndash Applied Biosystems

2 HeliScopetrade Single Molecule Sequencer - Helicos

o Long reads (400-20000 nt)

1 Genome Sequencer FLX System (454) ndash Roche

2 PacBio RS - Pacific Bioscience

3 Personal Genome Machine Ion Proton - Ion Torrent

4 GridION ndash Oxford Nanopore

Preparation of Sequencing Library

Library type Viral genomes Sequence recovery

mRNA DNARNA +++ possible full-length

Small RNA DNARNA +++

DNA DNA +++ possible full-length

DNA or RNA isolated from viruses

DNARNA +++++ full-length

mRNA purification may result in loss of sequences for viruses that lack polyA tails

AGV assembled from different sequencing samples

RNA isolated from gut

RNA isolated from whole aphid with 2 rounds polyA purification

RNA isolated from whole aphid with 1 round polyA purification

Green + strand Red - strand

Assembly

DNARNA contigs

RNADNAsmall RNA Reads

Host Genome

Known viruses

New viruses in known genera

Complete viral genomes

New viruses in new genera

Nucleotide database By BlastN

Protein database By BlastX

PCR RT-PCR

NGS for Virus Discovery

Modified from Ding amp Lu 2011 Curr Opin Virol 1533-544

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Insect viruses detecteddiscovered by use of NGS

Liu S Vijayendran D Bonning BC 2011 3(10)1849-69

Collect samples that show disease symptoms

Isolate viruses

Observe virus particles

Identify viral genomes

Clone genomic DNARNA - sequence (Sanger sequencing) - assemble viral genome

Traditional Approach for Virus Discovery

Advantages of NGS for Virus Discovery

bull Many viruses are latent or asymptomatic

bull NGS can identify viral sequences without background information on viruses

bull Viral genomes are assembled de novo without reference sequences

bull NGS has revolutionized virus discovery

sgRNA ()

4795 nt CP

pRdRP

VPg

CP

(P145)

P 10

RTD

P28 RTD () P35

UAG

Aphis glycines virus (AGV) -assembled from transcriptome

Similar to tetraviruses

Structurally resemble luteoviruses (plant virus)

A new insect virus with tetravirus-like RdRp and plant virus-like capsid protein

Outline

bull Introduction Why use NGS ndash Traditional approach for virus discovery ndash Next Generation Sequencing (NGS) ndash Advantages of NGS for virus discovery

bull How itrsquos done ndash Sample selection ndash Sequencing library preparation ndash Sequencing method ndash Assembly of sequencing reads ndash Identification of viral sequence ndash Assembly of viral genome

Sample Selection

bull Small sample size (10 ug or less RNA adequate)

-but the more the better

bull Tissue vs whole organism

-sequencing depth

bull Virus purification

-helps to identify full-length sequence

-better approach for DNA viruses

Sequencing Technologies

o Short reads (35-250 nt)

1 Genome Analyzer IIx (GAIIx) HiSeq2000 HiSeq2500 MiSeq ndash Illumina

(Hiseq2000 capable of up to 600Gb per run)

1 SOLiD 5500xl System ndash Applied Biosystems

2 HeliScopetrade Single Molecule Sequencer - Helicos

o Long reads (400-20000 nt)

1 Genome Sequencer FLX System (454) ndash Roche

2 PacBio RS - Pacific Bioscience

3 Personal Genome Machine Ion Proton - Ion Torrent

4 GridION ndash Oxford Nanopore

Preparation of Sequencing Library

Library type Viral genomes Sequence recovery

mRNA DNARNA +++ possible full-length

Small RNA DNARNA +++

DNA DNA +++ possible full-length

DNA or RNA isolated from viruses

DNARNA +++++ full-length

mRNA purification may result in loss of sequences for viruses that lack polyA tails

AGV assembled from different sequencing samples

RNA isolated from gut

RNA isolated from whole aphid with 2 rounds polyA purification

RNA isolated from whole aphid with 1 round polyA purification

Green + strand Red - strand

Assembly

DNARNA contigs

RNADNAsmall RNA Reads

Host Genome

Known viruses

New viruses in known genera

Complete viral genomes

New viruses in new genera

Nucleotide database By BlastN

Protein database By BlastX

PCR RT-PCR

NGS for Virus Discovery

Modified from Ding amp Lu 2011 Curr Opin Virol 1533-544

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Collect samples that show disease symptoms

Isolate viruses

Observe virus particles

Identify viral genomes

Clone genomic DNARNA - sequence (Sanger sequencing) - assemble viral genome

Traditional Approach for Virus Discovery

Advantages of NGS for Virus Discovery

bull Many viruses are latent or asymptomatic

bull NGS can identify viral sequences without background information on viruses

bull Viral genomes are assembled de novo without reference sequences

bull NGS has revolutionized virus discovery

sgRNA ()

4795 nt CP

pRdRP

VPg

CP

(P145)

P 10

RTD

P28 RTD () P35

UAG

Aphis glycines virus (AGV) -assembled from transcriptome

Similar to tetraviruses

Structurally resemble luteoviruses (plant virus)

A new insect virus with tetravirus-like RdRp and plant virus-like capsid protein

Outline

bull Introduction Why use NGS ndash Traditional approach for virus discovery ndash Next Generation Sequencing (NGS) ndash Advantages of NGS for virus discovery

bull How itrsquos done ndash Sample selection ndash Sequencing library preparation ndash Sequencing method ndash Assembly of sequencing reads ndash Identification of viral sequence ndash Assembly of viral genome

Sample Selection

bull Small sample size (10 ug or less RNA adequate)

-but the more the better

bull Tissue vs whole organism

-sequencing depth

bull Virus purification

-helps to identify full-length sequence

-better approach for DNA viruses

Sequencing Technologies

o Short reads (35-250 nt)

1 Genome Analyzer IIx (GAIIx) HiSeq2000 HiSeq2500 MiSeq ndash Illumina

(Hiseq2000 capable of up to 600Gb per run)

1 SOLiD 5500xl System ndash Applied Biosystems

2 HeliScopetrade Single Molecule Sequencer - Helicos

o Long reads (400-20000 nt)

1 Genome Sequencer FLX System (454) ndash Roche

2 PacBio RS - Pacific Bioscience

3 Personal Genome Machine Ion Proton - Ion Torrent

4 GridION ndash Oxford Nanopore

Preparation of Sequencing Library

Library type Viral genomes Sequence recovery

mRNA DNARNA +++ possible full-length

Small RNA DNARNA +++

DNA DNA +++ possible full-length

DNA or RNA isolated from viruses

DNARNA +++++ full-length

mRNA purification may result in loss of sequences for viruses that lack polyA tails

AGV assembled from different sequencing samples

RNA isolated from gut

RNA isolated from whole aphid with 2 rounds polyA purification

RNA isolated from whole aphid with 1 round polyA purification

Green + strand Red - strand

Assembly

DNARNA contigs

RNADNAsmall RNA Reads

Host Genome

Known viruses

New viruses in known genera

Complete viral genomes

New viruses in new genera

Nucleotide database By BlastN

Protein database By BlastX

PCR RT-PCR

NGS for Virus Discovery

Modified from Ding amp Lu 2011 Curr Opin Virol 1533-544

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Advantages of NGS for Virus Discovery

bull Many viruses are latent or asymptomatic

bull NGS can identify viral sequences without background information on viruses

bull Viral genomes are assembled de novo without reference sequences

bull NGS has revolutionized virus discovery

sgRNA ()

4795 nt CP

pRdRP

VPg

CP

(P145)

P 10

RTD

P28 RTD () P35

UAG

Aphis glycines virus (AGV) -assembled from transcriptome

Similar to tetraviruses

Structurally resemble luteoviruses (plant virus)

A new insect virus with tetravirus-like RdRp and plant virus-like capsid protein

Outline

bull Introduction Why use NGS ndash Traditional approach for virus discovery ndash Next Generation Sequencing (NGS) ndash Advantages of NGS for virus discovery

bull How itrsquos done ndash Sample selection ndash Sequencing library preparation ndash Sequencing method ndash Assembly of sequencing reads ndash Identification of viral sequence ndash Assembly of viral genome

Sample Selection

bull Small sample size (10 ug or less RNA adequate)

-but the more the better

bull Tissue vs whole organism

-sequencing depth

bull Virus purification

-helps to identify full-length sequence

-better approach for DNA viruses

Sequencing Technologies

o Short reads (35-250 nt)

1 Genome Analyzer IIx (GAIIx) HiSeq2000 HiSeq2500 MiSeq ndash Illumina

(Hiseq2000 capable of up to 600Gb per run)

1 SOLiD 5500xl System ndash Applied Biosystems

2 HeliScopetrade Single Molecule Sequencer - Helicos

o Long reads (400-20000 nt)

1 Genome Sequencer FLX System (454) ndash Roche

2 PacBio RS - Pacific Bioscience

3 Personal Genome Machine Ion Proton - Ion Torrent

4 GridION ndash Oxford Nanopore

Preparation of Sequencing Library

Library type Viral genomes Sequence recovery

mRNA DNARNA +++ possible full-length

Small RNA DNARNA +++

DNA DNA +++ possible full-length

DNA or RNA isolated from viruses

DNARNA +++++ full-length

mRNA purification may result in loss of sequences for viruses that lack polyA tails

AGV assembled from different sequencing samples

RNA isolated from gut

RNA isolated from whole aphid with 2 rounds polyA purification

RNA isolated from whole aphid with 1 round polyA purification

Green + strand Red - strand

Assembly

DNARNA contigs

RNADNAsmall RNA Reads

Host Genome

Known viruses

New viruses in known genera

Complete viral genomes

New viruses in new genera

Nucleotide database By BlastN

Protein database By BlastX

PCR RT-PCR

NGS for Virus Discovery

Modified from Ding amp Lu 2011 Curr Opin Virol 1533-544

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

sgRNA ()

4795 nt CP

pRdRP

VPg

CP

(P145)

P 10

RTD

P28 RTD () P35

UAG

Aphis glycines virus (AGV) -assembled from transcriptome

Similar to tetraviruses

Structurally resemble luteoviruses (plant virus)

A new insect virus with tetravirus-like RdRp and plant virus-like capsid protein

Outline

bull Introduction Why use NGS ndash Traditional approach for virus discovery ndash Next Generation Sequencing (NGS) ndash Advantages of NGS for virus discovery

bull How itrsquos done ndash Sample selection ndash Sequencing library preparation ndash Sequencing method ndash Assembly of sequencing reads ndash Identification of viral sequence ndash Assembly of viral genome

Sample Selection

bull Small sample size (10 ug or less RNA adequate)

-but the more the better

bull Tissue vs whole organism

-sequencing depth

bull Virus purification

-helps to identify full-length sequence

-better approach for DNA viruses

Sequencing Technologies

o Short reads (35-250 nt)

1 Genome Analyzer IIx (GAIIx) HiSeq2000 HiSeq2500 MiSeq ndash Illumina

(Hiseq2000 capable of up to 600Gb per run)

1 SOLiD 5500xl System ndash Applied Biosystems

2 HeliScopetrade Single Molecule Sequencer - Helicos

o Long reads (400-20000 nt)

1 Genome Sequencer FLX System (454) ndash Roche

2 PacBio RS - Pacific Bioscience

3 Personal Genome Machine Ion Proton - Ion Torrent

4 GridION ndash Oxford Nanopore

Preparation of Sequencing Library

Library type Viral genomes Sequence recovery

mRNA DNARNA +++ possible full-length

Small RNA DNARNA +++

DNA DNA +++ possible full-length

DNA or RNA isolated from viruses

DNARNA +++++ full-length

mRNA purification may result in loss of sequences for viruses that lack polyA tails

AGV assembled from different sequencing samples

RNA isolated from gut

RNA isolated from whole aphid with 2 rounds polyA purification

RNA isolated from whole aphid with 1 round polyA purification

Green + strand Red - strand

Assembly

DNARNA contigs

RNADNAsmall RNA Reads

Host Genome

Known viruses

New viruses in known genera

Complete viral genomes

New viruses in new genera

Nucleotide database By BlastN

Protein database By BlastX

PCR RT-PCR

NGS for Virus Discovery

Modified from Ding amp Lu 2011 Curr Opin Virol 1533-544

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Outline

bull Introduction Why use NGS ndash Traditional approach for virus discovery ndash Next Generation Sequencing (NGS) ndash Advantages of NGS for virus discovery

bull How itrsquos done ndash Sample selection ndash Sequencing library preparation ndash Sequencing method ndash Assembly of sequencing reads ndash Identification of viral sequence ndash Assembly of viral genome

Sample Selection

bull Small sample size (10 ug or less RNA adequate)

-but the more the better

bull Tissue vs whole organism

-sequencing depth

bull Virus purification

-helps to identify full-length sequence

-better approach for DNA viruses

Sequencing Technologies

o Short reads (35-250 nt)

1 Genome Analyzer IIx (GAIIx) HiSeq2000 HiSeq2500 MiSeq ndash Illumina

(Hiseq2000 capable of up to 600Gb per run)

1 SOLiD 5500xl System ndash Applied Biosystems

2 HeliScopetrade Single Molecule Sequencer - Helicos

o Long reads (400-20000 nt)

1 Genome Sequencer FLX System (454) ndash Roche

2 PacBio RS - Pacific Bioscience

3 Personal Genome Machine Ion Proton - Ion Torrent

4 GridION ndash Oxford Nanopore

Preparation of Sequencing Library

Library type Viral genomes Sequence recovery

mRNA DNARNA +++ possible full-length

Small RNA DNARNA +++

DNA DNA +++ possible full-length

DNA or RNA isolated from viruses

DNARNA +++++ full-length

mRNA purification may result in loss of sequences for viruses that lack polyA tails

AGV assembled from different sequencing samples

RNA isolated from gut

RNA isolated from whole aphid with 2 rounds polyA purification

RNA isolated from whole aphid with 1 round polyA purification

Green + strand Red - strand

Assembly

DNARNA contigs

RNADNAsmall RNA Reads

Host Genome

Known viruses

New viruses in known genera

Complete viral genomes

New viruses in new genera

Nucleotide database By BlastN

Protein database By BlastX

PCR RT-PCR

NGS for Virus Discovery

Modified from Ding amp Lu 2011 Curr Opin Virol 1533-544

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Sample Selection

bull Small sample size (10 ug or less RNA adequate)

-but the more the better

bull Tissue vs whole organism

-sequencing depth

bull Virus purification

-helps to identify full-length sequence

-better approach for DNA viruses

Sequencing Technologies

o Short reads (35-250 nt)

1 Genome Analyzer IIx (GAIIx) HiSeq2000 HiSeq2500 MiSeq ndash Illumina

(Hiseq2000 capable of up to 600Gb per run)

1 SOLiD 5500xl System ndash Applied Biosystems

2 HeliScopetrade Single Molecule Sequencer - Helicos

o Long reads (400-20000 nt)

1 Genome Sequencer FLX System (454) ndash Roche

2 PacBio RS - Pacific Bioscience

3 Personal Genome Machine Ion Proton - Ion Torrent

4 GridION ndash Oxford Nanopore

Preparation of Sequencing Library

Library type Viral genomes Sequence recovery

mRNA DNARNA +++ possible full-length

Small RNA DNARNA +++

DNA DNA +++ possible full-length

DNA or RNA isolated from viruses

DNARNA +++++ full-length

mRNA purification may result in loss of sequences for viruses that lack polyA tails

AGV assembled from different sequencing samples

RNA isolated from gut

RNA isolated from whole aphid with 2 rounds polyA purification

RNA isolated from whole aphid with 1 round polyA purification

Green + strand Red - strand

Assembly

DNARNA contigs

RNADNAsmall RNA Reads

Host Genome

Known viruses

New viruses in known genera

Complete viral genomes

New viruses in new genera

Nucleotide database By BlastN

Protein database By BlastX

PCR RT-PCR

NGS for Virus Discovery

Modified from Ding amp Lu 2011 Curr Opin Virol 1533-544

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Sequencing Technologies

o Short reads (35-250 nt)

1 Genome Analyzer IIx (GAIIx) HiSeq2000 HiSeq2500 MiSeq ndash Illumina

(Hiseq2000 capable of up to 600Gb per run)

1 SOLiD 5500xl System ndash Applied Biosystems

2 HeliScopetrade Single Molecule Sequencer - Helicos

o Long reads (400-20000 nt)

1 Genome Sequencer FLX System (454) ndash Roche

2 PacBio RS - Pacific Bioscience

3 Personal Genome Machine Ion Proton - Ion Torrent

4 GridION ndash Oxford Nanopore

Preparation of Sequencing Library

Library type Viral genomes Sequence recovery

mRNA DNARNA +++ possible full-length

Small RNA DNARNA +++

DNA DNA +++ possible full-length

DNA or RNA isolated from viruses

DNARNA +++++ full-length

mRNA purification may result in loss of sequences for viruses that lack polyA tails

AGV assembled from different sequencing samples

RNA isolated from gut

RNA isolated from whole aphid with 2 rounds polyA purification

RNA isolated from whole aphid with 1 round polyA purification

Green + strand Red - strand

Assembly

DNARNA contigs

RNADNAsmall RNA Reads

Host Genome

Known viruses

New viruses in known genera

Complete viral genomes

New viruses in new genera

Nucleotide database By BlastN

Protein database By BlastX

PCR RT-PCR

NGS for Virus Discovery

Modified from Ding amp Lu 2011 Curr Opin Virol 1533-544

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Preparation of Sequencing Library

Library type Viral genomes Sequence recovery

mRNA DNARNA +++ possible full-length

Small RNA DNARNA +++

DNA DNA +++ possible full-length

DNA or RNA isolated from viruses

DNARNA +++++ full-length

mRNA purification may result in loss of sequences for viruses that lack polyA tails

AGV assembled from different sequencing samples

RNA isolated from gut

RNA isolated from whole aphid with 2 rounds polyA purification

RNA isolated from whole aphid with 1 round polyA purification

Green + strand Red - strand

Assembly

DNARNA contigs

RNADNAsmall RNA Reads

Host Genome

Known viruses

New viruses in known genera

Complete viral genomes

New viruses in new genera

Nucleotide database By BlastN

Protein database By BlastX

PCR RT-PCR

NGS for Virus Discovery

Modified from Ding amp Lu 2011 Curr Opin Virol 1533-544

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

AGV assembled from different sequencing samples

RNA isolated from gut

RNA isolated from whole aphid with 2 rounds polyA purification

RNA isolated from whole aphid with 1 round polyA purification

Green + strand Red - strand

Assembly

DNARNA contigs

RNADNAsmall RNA Reads

Host Genome

Known viruses

New viruses in known genera

Complete viral genomes

New viruses in new genera

Nucleotide database By BlastN

Protein database By BlastX

PCR RT-PCR

NGS for Virus Discovery

Modified from Ding amp Lu 2011 Curr Opin Virol 1533-544

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Assembly

DNARNA contigs

RNADNAsmall RNA Reads

Host Genome

Known viruses

New viruses in known genera

Complete viral genomes

New viruses in new genera

Nucleotide database By BlastN

Protein database By BlastX

PCR RT-PCR

NGS for Virus Discovery

Modified from Ding amp Lu 2011 Curr Opin Virol 1533-544

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Assembly of Sequencing Reads -pre-processing of sequence data

bull Remove potential adaptor index sequences

bull Check sequencing quality

ndash Quality score GC content

ndash Read length distribution

ndash Overrepresented sequences

ndash etc

bull If necessary trim bases with low quality

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Trimming of Bases with Low QS

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Trimming of bases with low quality scores may result in loss of viral sequences

NOTE The near full-length genome of AGV was assembled from an untrimmed data set with poor quality scores The genome could not be assembled from the data set following standard trimming

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Software for Checking Sequence Quality-FastQC

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Sequence Count Percentage Possible Source

AGATCGGAAGAG

CACACGTCTGAAC

TCCAGTCACCTTG

TAATCTCGTATG

1968861 220

TruSeq Adapter

Index 12 (100

over 49bp)

Overrepresented Sequences

Sequence Count Percentage Possible Source

CAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTGT 51018488 4090 No Hit

TCAGATTTCGGGCTAAAGGGAATACGGTTAAAATC

CCGTGACCTGCCCTG 24264170 1945 No Hit

The seqeunces were derived from Penaeus vannamei 18S ribosomal RNA -cotaminated in sRNA

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Software for manipulating sequencing data

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

CLC Genomics Workbench (US$5000 per copy gtUS$1000per year for update)

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Assembly of Sequencing Reads

bull de novo assembly or mapping (alignment) -de novo assembly searching for new viruses no reference is needed -mapping re-sequencing SNP isolate need reference sequences (MARA GATK and other toolkits) bull de novo assembly may provide extra information about

known viral sequences Shrimp virus Infectious myonecrosis virus (IMNV a dsRNA virus) - documented seq 7560 bp (Poulos et al JGV 2006 87 987-996)

- de novo assembled from RNA-seq 8233 bp RT-PCR proved IMNV should have at least 8233 bp

Thursday 945 am 168 Virus 4 Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Trinity for Assembly

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

OasesVelvet for Assembly

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Running the Assembly Program

bull Two most important parameters for assembly ndash K-mers (word length) length of sequence

fragments used for joining

ndash C - coverage cut-off

bull Different combinations of K and C will result in assembly of different contigs

bull Multiple K and C should be tested for best results (Liu et al PLoS One 20127(9)e45161 doi

101371journalpone)

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Multiple K Test for Assembly of AGV using OasesVelvet

(Here) read = contig Green + strand Red - strand

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

bull Annotation of contigs

-search for viral genes using BLASTx or BLASTn

bull BLAST against NCBI database

bull BLAST using your own databases

bull Blast2GO platform

-annotation of contigs

-motif search

-analysis of annotation data

Data Analysis How do we find viral sequences

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Data Analysis Analyzing virus-derived contigs

bull Extract BLAST data (sequences with virus as top hit)

bull Organize contigs that hit the same or similar viruses

bull Join contigs into viral genome

bull Design primers for PCRRT-PCR to fill sequence gaps

bull Sequence to confirm in silico cloning result

bull 5rsquo and 3rsquo RACE to identify end sequences

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Working with Viral Contigs

viral gene == virus

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

7815 8536

8874 9193

9476 9737

8523 8730

9495 9307

501 2315

2656 4084

2295 2679

1 244

522 340

6337 5056

5079 4486

4508 4062

6377 6759

6839 7221

7539

7840 7538 7310

24

4

34

0

63

37

6

37

7

67

59

6

83

9

72

21

7

31

0

97

37

75

38

Trinity Assembly of APV2 (gt9800 nt) Assembled using sRNA isolated from pea aphid

7539

APV2-Acyrthosiphon pisum virus 2 (dicistrovrius)

+ strand

- strand

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Summary

bull No single rule can be used to find a virus by NGS

bull Knowledge of virology can greatly help for analyzing NGS data

bull Manual alignment of virus derived sequences may be needed

bull Biological evidence is required for verifying true nature of viral sequences discovered by NGS

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy

Acknowledgements

John K VanDyk Lyric Bartholomay Duan Loy