Benchmarking short-read mapping programs

Benchmarking DNA mapping program

Stella HartonoTagkopoulos and Korf Lab

INTRODUCTION

WHY DO BENCHMARKING?

There are 20+ mapping programs Bowtie, Bowtie2, Eland, BFAST, BWA, GMAP, MAQ,

MOSAIK, RMAP, Zoom, SHRiMP, SOAP2, etc

Which is the best program? It depends on what you define by best Depends on data, time, available resources,

expertise

WHAT HAPPEN IF YOU DON’T BENCHMARK?

Choose random program Top Google Hit “Friend told me”

Might use wrong program for your data Results might not be as accurate as what is

reported Might take too much time

MY PROJECT

Question: What is the best mapping program for short read (Illumina) and long read (PacBio) data?

I used 3 main criteria:• Accuracy• Speed• Memory Usage

METHODS

DATA IS GENERATED USING READ SIMULATION PROGRAM

• In order to assess accuracy, I simulated reads tagged with their correct coordinate

• Dwgsim: DNAA Whole Genome Simulator• Available on GitHub

• Outputs reads that mimic various sequencing platforms• Illumina• PacBio• IonTorrent

• Has a feature that evaluate result generated by mapping programs

READ DATA PARAMETERS

• Genome sequence used: Human Genome (hg19)• Chr 1:50-60 Mb (represent average human

genome)• Dwgsim randomly “chops” up genomic sequence file

• Illumina-like reads • 100 bp long• 0.5 to 2% (increasing along the read) base substitution

• PacBio-like reads • 3000 bp long• 16% random error rate represented by 14% indels and

2% base substitution• Coverage: 4x and 20x

MAPPING PROGRAMS

• There are 20+ mapping programs available• Ideally, I should try all of them, but within 1

month rotation, I was only able to try 81. BWA2. Bowtie23. MAQ4. Soap25. Rmap – output format can’t be evaluated by Dwgsim6. SHRiMP - output format can’t be evaluated by Dwgsim7. SSAHA2 - very slow (10-20x times slower)8. Novoalign – very slow (10-20x times slower)

??

RESULTS

ILLUMINA-LIKE READ: ACCURACY

Accuracy = (Read Mapped Correctly/Total Read) *100% BWA and Bowtie2 have the best accuracy Soap2 has least accuracy

ILLUMINA-LIKE READ: SPEED

Bowtie2 is slowest Speed within each programs in different coverages decrease in

linear fashion (20x = 5*4x)

Bow2 uses the least memory All but MAQ use consistent memory between coverage Memory used by MAQ increased ~4 times at 20x coverage

ILLUMINA-LIKE READ: MEMORY

Ranking Table (lower is better)

BWA is accurate, fast, and quite memory efficient Bow2 is accurate and memory efficient, but slow MAQ is pretty accurate, fast, but uses lots of memory SOAP2 is fast, but not very accurate, and uses lots of

memory

ILLUMINA-LIKE READ: OVERALL

Accuracy Speed Memory

BWA 1 (94-95%) 1 2 (150MB)

Bow2 1 (94-95%) 4 1 (80MB)

MAQ 3 (90-91%) 1 4 (300-1200MB)

Soap2 4 (71-82%) 1 3 (650MB)

PACBIO-LIKE READ

All but BWA failed to map anything

Newest BWA has function specific for PacBio

CONCLUSION

Benchmarking 4 Mapping Programs (BWA, Bowtie2, MAQ, Soap2)

Criteria: Accuracy, Speed, and Memory Illumina-like Reads (100bp, 0.5-2% substitution

rate) BWA is the best for Illumina-like Data

Pacbio-like Reads (3000bp, 4% indels, 2% substitution) All but BWA failed BWA is the best for Pacbio-like Read High accuracy (~90%)

CONCLUSION

It takes a lot of effort to benchmark programs, but the results are useful

From this rotation, I learned that BWA seems to be the best for mapping both short and long read data

Future Directions: Different data types (Nanopore 60kb reads?) Benchmark more programs Fine tune parameters for each programs

ACKNOWLEDGEMENTS

• UC Davis GGG for funding• My overlords in Tagkopoulos and Korf Lab:

Ilias Tagkopoulos Ian Korf

• Everyone else in the lab!• Vadim, Jiyeon, Eren, Linh, Ted, Keith B, Keith D, Natalie,

Ken, Paul, Abby, Yen, Kristen, Matt, Daniel, Danielle,

Education

Benchmarking short-read mapping programs