Upload
keith-bradnam
View
2.933
Download
10
Embed Size (px)
DESCRIPTION
These slides were made by Stella Hartono, in the Korf Lab, UC Davis. For a rotation project in graduate school, she benchmarked the performance of various short-read mapping programs using simulated datasets.
Citation preview
Benchmarking DNA mapping program
Stella HartonoTagkopoulos and Korf Lab
INTRODUCTION
WHY DO BENCHMARKING?
There are 20+ mapping programs Bowtie, Bowtie2, Eland, BFAST, BWA, GMAP, MAQ,
MOSAIK, RMAP, Zoom, SHRiMP, SOAP2, etc
Which is the best program? It depends on what you define by best Depends on data, time, available resources,
expertise
WHAT HAPPEN IF YOU DON’T BENCHMARK?
Choose random program Top Google Hit “Friend told me”
Might use wrong program for your data Results might not be as accurate as what is
reported Might take too much time
MY PROJECT
Question: What is the best mapping program for short read (Illumina) and long read (PacBio) data?
I used 3 main criteria:• Accuracy• Speed• Memory Usage
METHODS
DATA IS GENERATED USING READ SIMULATION PROGRAM
• In order to assess accuracy, I simulated reads tagged with their correct coordinate
• Dwgsim: DNAA Whole Genome Simulator• Available on GitHub
• Outputs reads that mimic various sequencing platforms• Illumina• PacBio• IonTorrent
• Has a feature that evaluate result generated by mapping programs
READ DATA PARAMETERS
• Genome sequence used: Human Genome (hg19)• Chr 1:50-60 Mb (represent average human
genome)• Dwgsim randomly “chops” up genomic sequence file
• Illumina-like reads • 100 bp long• 0.5 to 2% (increasing along the read) base substitution
• PacBio-like reads • 3000 bp long• 16% random error rate represented by 14% indels and
2% base substitution• Coverage: 4x and 20x
MAPPING PROGRAMS
• There are 20+ mapping programs available• Ideally, I should try all of them, but within 1
month rotation, I was only able to try 81. BWA2. Bowtie23. MAQ4. Soap25. Rmap – output format can’t be evaluated by Dwgsim6. SHRiMP - output format can’t be evaluated by Dwgsim7. SSAHA2 - very slow (10-20x times slower)8. Novoalign – very slow (10-20x times slower)
??
RESULTS
ILLUMINA-LIKE READ: ACCURACY
Accuracy = (Read Mapped Correctly/Total Read) *100% BWA and Bowtie2 have the best accuracy Soap2 has least accuracy
ILLUMINA-LIKE READ: SPEED
Bowtie2 is slowest Speed within each programs in different coverages decrease in
linear fashion (20x = 5*4x)
Bow2 uses the least memory All but MAQ use consistent memory between coverage Memory used by MAQ increased ~4 times at 20x coverage
ILLUMINA-LIKE READ: MEMORY
Ranking Table (lower is better)
BWA is accurate, fast, and quite memory efficient Bow2 is accurate and memory efficient, but slow MAQ is pretty accurate, fast, but uses lots of memory SOAP2 is fast, but not very accurate, and uses lots of
memory
ILLUMINA-LIKE READ: OVERALL
Accuracy Speed Memory
BWA 1 (94-95%) 1 2 (150MB)
Bow2 1 (94-95%) 4 1 (80MB)
MAQ 3 (90-91%) 1 4 (300-1200MB)
Soap2 4 (71-82%) 1 3 (650MB)
PACBIO-LIKE READ
All but BWA failed to map anything
Newest BWA has function specific for PacBio
CONCLUSION
Benchmarking 4 Mapping Programs (BWA, Bowtie2, MAQ, Soap2)
Criteria: Accuracy, Speed, and Memory Illumina-like Reads (100bp, 0.5-2% substitution
rate) BWA is the best for Illumina-like Data
Pacbio-like Reads (3000bp, 4% indels, 2% substitution) All but BWA failed BWA is the best for Pacbio-like Read High accuracy (~90%)
CONCLUSION
It takes a lot of effort to benchmark programs, but the results are useful
From this rotation, I learned that BWA seems to be the best for mapping both short and long read data
Future Directions: Different data types (Nanopore 60kb reads?) Benchmark more programs Fine tune parameters for each programs
ACKNOWLEDGEMENTS
• UC Davis GGG for funding• My overlords in Tagkopoulos and Korf Lab:
Ilias Tagkopoulos Ian Korf
• Everyone else in the lab!• Vadim, Jiyeon, Eren, Linh, Ted, Keith B, Keith D, Natalie,
Ken, Paul, Abby, Yen, Kristen, Matt, Daniel, Danielle,