18
Benchmarking DNA mapping program Stella Hartono Tagkopoulos and Korf Lab

Benchmarking short-read mapping programs

Embed Size (px)

DESCRIPTION

These slides were made by Stella Hartono, in the Korf Lab, UC Davis. For a rotation project in graduate school, she benchmarked the performance of various short-read mapping programs using simulated datasets.

Citation preview

Page 1: Benchmarking short-read mapping programs

Benchmarking DNA mapping program

Stella HartonoTagkopoulos and Korf Lab

Page 2: Benchmarking short-read mapping programs

INTRODUCTION

Page 3: Benchmarking short-read mapping programs

WHY DO BENCHMARKING?

There are 20+ mapping programs Bowtie, Bowtie2, Eland, BFAST, BWA, GMAP, MAQ,

MOSAIK, RMAP, Zoom, SHRiMP, SOAP2, etc

Which is the best program? It depends on what you define by best Depends on data, time, available resources,

expertise

Page 4: Benchmarking short-read mapping programs

WHAT HAPPEN IF YOU DON’T BENCHMARK?

Choose random program Top Google Hit “Friend told me”

Might use wrong program for your data Results might not be as accurate as what is

reported Might take too much time

Page 5: Benchmarking short-read mapping programs

MY PROJECT

Question: What is the best mapping program for short read (Illumina) and long read (PacBio) data?

I used 3 main criteria:• Accuracy• Speed• Memory Usage

Page 6: Benchmarking short-read mapping programs

METHODS

Page 7: Benchmarking short-read mapping programs

DATA IS GENERATED USING READ SIMULATION PROGRAM

• In order to assess accuracy, I simulated reads tagged with their correct coordinate

• Dwgsim: DNAA Whole Genome Simulator• Available on GitHub

• Outputs reads that mimic various sequencing platforms• Illumina• PacBio• IonTorrent

• Has a feature that evaluate result generated by mapping programs

Page 8: Benchmarking short-read mapping programs

READ DATA PARAMETERS

• Genome sequence used: Human Genome (hg19)• Chr 1:50-60 Mb (represent average human

genome)• Dwgsim randomly “chops” up genomic sequence file

• Illumina-like reads • 100 bp long• 0.5 to 2% (increasing along the read) base substitution

• PacBio-like reads • 3000 bp long• 16% random error rate represented by 14% indels and

2% base substitution• Coverage: 4x and 20x

Page 9: Benchmarking short-read mapping programs

MAPPING PROGRAMS

• There are 20+ mapping programs available• Ideally, I should try all of them, but within 1

month rotation, I was only able to try 81. BWA2. Bowtie23. MAQ4. Soap25. Rmap – output format can’t be evaluated by Dwgsim6. SHRiMP - output format can’t be evaluated by Dwgsim7. SSAHA2 - very slow (10-20x times slower)8. Novoalign – very slow (10-20x times slower)

??

Page 10: Benchmarking short-read mapping programs

RESULTS

Page 11: Benchmarking short-read mapping programs

ILLUMINA-LIKE READ: ACCURACY

Accuracy = (Read Mapped Correctly/Total Read) *100% BWA and Bowtie2 have the best accuracy Soap2 has least accuracy

Page 12: Benchmarking short-read mapping programs

ILLUMINA-LIKE READ: SPEED

Bowtie2 is slowest Speed within each programs in different coverages decrease in

linear fashion (20x = 5*4x)

Page 13: Benchmarking short-read mapping programs

Bow2 uses the least memory All but MAQ use consistent memory between coverage Memory used by MAQ increased ~4 times at 20x coverage

ILLUMINA-LIKE READ: MEMORY

Page 14: Benchmarking short-read mapping programs

Ranking Table (lower is better)

BWA is accurate, fast, and quite memory efficient Bow2 is accurate and memory efficient, but slow MAQ is pretty accurate, fast, but uses lots of memory SOAP2 is fast, but not very accurate, and uses lots of

memory

ILLUMINA-LIKE READ: OVERALL

Accuracy Speed Memory

BWA 1 (94-95%) 1 2 (150MB)

Bow2 1 (94-95%) 4 1 (80MB)

MAQ 3 (90-91%) 1 4 (300-1200MB)

Soap2 4 (71-82%) 1 3 (650MB)

Page 15: Benchmarking short-read mapping programs

PACBIO-LIKE READ

All but BWA failed to map anything

Newest BWA has function specific for PacBio

Page 16: Benchmarking short-read mapping programs

CONCLUSION

Benchmarking 4 Mapping Programs (BWA, Bowtie2, MAQ, Soap2)

Criteria: Accuracy, Speed, and Memory Illumina-like Reads (100bp, 0.5-2% substitution

rate) BWA is the best for Illumina-like Data

Pacbio-like Reads (3000bp, 4% indels, 2% substitution) All but BWA failed BWA is the best for Pacbio-like Read High accuracy (~90%)

Page 17: Benchmarking short-read mapping programs

CONCLUSION

It takes a lot of effort to benchmark programs, but the results are useful

From this rotation, I learned that BWA seems to be the best for mapping both short and long read data

Future Directions: Different data types (Nanopore 60kb reads?) Benchmark more programs Fine tune parameters for each programs

Page 18: Benchmarking short-read mapping programs

ACKNOWLEDGEMENTS

• UC Davis GGG for funding• My overlords in Tagkopoulos and Korf Lab:

Ilias Tagkopoulos Ian Korf

• Everyone else in the lab!• Vadim, Jiyeon, Eren, Linh, Ted, Keith B, Keith D, Natalie,

Ken, Paul, Abby, Yen, Kristen, Matt, Daniel, Danielle,