Upload
workhorse-computing
View
725
Download
1
Embed Size (px)
Citation preview
Douglas CorkSteven Lembark
HIV1, Wcurves, & Shoe Leather
● Existing genetics tools fail on HIV1● They make assumptions based on “normal” DNA
that fail on HIV – or cancer, or plants.● Correlation tools look at evolution, not state.
● We are working on tools for clinical analysis.● The Wcurve abstracts DNA into geometry.● The TSP clusters genenes rather than trying to
impute inheritence.
Sequences Inform Treatment
● Treating HIV requires sequencing it to choose appropriate drugs:● HIV1 evolves drug resistence in months.● Multiple strains in a single pateint are common,
both from multiple sources or evolution.● Crossover recombination relatively common due to
crossinfected cells.
Problem: HIV is Hard to Analyze
● HIV is a noncorrecting retrovirus.● Evolves 10,000 times faster than humans or
influenza – one new strain per patient per day.● Genomes for wild types range from 8349 to
9829 bases, making localized comparisions difficult.
● The single FDA approved algorithm directing treatment from sequence handles only typeB; the U.S. Army has 15%+ nonB infections.
The Current Tools
● Blast, Fasta, ClustalW perform alignment.● Tabledriven analysis of base transitions.● Score the entire sequence with a single value.
● Graphical tools are designed to display inheritence rather than state.● Output is difficult to read in a clinical setting.
Phenogram of DrugResistant and RandomSamples
● Tries to show ancestory, not state.
● Not very good for visual identification of which patients are drug resistant.
Trees are not particularlyhelpful either.
ClustalW of gp120
● Difficult to compare sequences vis.ually.
● Not useful for large numbers of sequences.
● Gaps make analysis difficult
HIVHXB2CG TGATCTGTAGTGCTACAGAAAAATTGTGGGTCACAGTCTATTATGGGGTACCTGTGTGGAAY736838-gp120_ -------------------------------TACAGTTTATTATGGGGTGCCTGTGTGGA ***** *********** **********HIVHXB2CG AGGAAGCAACCACCACTCTATTTTGTGCATCAGATGCTAAAGCATATGATACAGAGGTACAY736838-gp120_ GAGATGCAGATACCACCCTATTTTGTGCATCAGATGCCAAGGCACATGAGACAGAAGTGC ** *** ***** ******************** ** *** **** ***** ** *HIVHXB2CG ATAATGTTTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAGTAGTATAY736838-gp120_ ACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACACC * ***** ********************************************* **HIVHXB2CG TGGTAAATGTGACAGAAAATTTTAACATGTGGAAAAATGACATGGTAGAACAGATGCATGAY736838-gp120_ TGGAAAATGTAACAGAAAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGG *** ****** *************************** ********** ******** *HIVHXB2CG AGGATATAATCAGTTTATGGGATCAAAGCCTAAAGCCATGTGTAAAATTAACCCCACTCTAY736838-gp120_ AGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCT ***** ********************** ***************** ***** ** ****HIVHXB2CG GTGTTAGTTTAAAGTGCAC------TGATTTGAAGAATGATACTAATACCAATAGTAGTAAY736838-gp120_ GCGTTACTTTAAATTGTACCAATGCTAATTTGACCAATGGCAGTAGCAAAACCAATGTCT * **** ****** ** ** * ****** **** * ** * * * *HIVHXB2CG GCGGGAGAATGATAATGGAGAAAGGAGAGATAAAAAACTGCTCTTTCAATATCAGCACAAAY736838-gp120_ CTAACATAATAGGAAATATAACAGATGAAGTAAGAAACTGTACTTTTAATATGACCACAG * *** ** * ** ** *** ****** **** ***** * ****HIVHXB2CG GCATAAGAGGTAAGGTGCAGAAAGAATATGCATTTTTTTATAAACTTGATATAATACCAA
AY736838-gp120_ AACTAACAGATAAGAAGCAGAAGGTCCATGCACTCTTTTATAAGCTTGATATAGTACAAA *** ** **** ****** * ***** * ******** ********* *** **HIVHXB2CG T---AGATAATGATACTACCAGC---TATAAGTTGACAAGTTGTAACACCTCAGTCATTAAY736838-gp120_ TTGAAGATAAGAAGAATAGTAGTGAGTATAGGTTAATAAATTGTAATACTTCAGTCATTA * ****** * * ** ** **** *** * ** ****** ** **********HIVHXB2CG CACAGGCCTGTCCAAAGGTATCCTTTGAGCCAATTCCCATACATTATTGTGCCCCGGCTGAY736838-gp120_ AGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTG ***** ********* ********** ******** ************ * ** ****HIVHXB2CG GTTTTGCGATTCTAAAATGTAATAATAAGACGTTCAATGGAACAGGACCATGTACAAATGAY736838-gp120_ GTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATG *** ******* **** ****** ****** ******** ***** ******* *****HIVHXB2CG TCAGCACAGTACAATGTACACATGGAATTAGGCCAGTAGTATCAACTCAACTGCTGTTAAAY736838-gp120_ TCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAA ***** ********** ************* ****** ************ *********HIVHXB2CG ATGGCAGTCTAGCAGAAGAAGAGGTAGTAATTAGATCTGTCAATTTCACGGACAATGCTAAY736838-gp120_ ATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAGATCTCACAAACAATGCCA *********************** ** **** ******* ** **** ******* *HIVHXB2CG AAACCATAATAGTACAGCTGAACACATCTGTAGAAATTAATTGTACAAGACCCAACAACAAY736838-gp120_ AAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTCCAACA ************* ** ** ** * ************ ******** ****** *****HIVHXB2CG ATACAAGAAAAAGAATCCGTATCCAGAGAGGACCAGGGAGAGCATTTGTTACAATAGGAAAY736838-gp120_ ATACAAGAACAAGTATAACTAT------AGGACCAGGACGAGTATTCTATAGAACAGGAG ********* *** ** *** ********* *** *** ** ** ****HIVHXB2CG A---AATAGGAAATATGAGACAAGCACATTGTAACATTAGTAGAGCAAAATGGAATAACAAY736838-gp120_ ATATAATAGGAAATATAAGAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATAAAG * ************ *** ***** ***** * **** * ** *************HIVHXB2CG CTTTAAAACAGATAGCTAGCAAATTAAGAGAACAATTTGGAAATAATAAAACAATAATCTAY736838-gp120_ TTTTAAAACAGGTAACTGAAAAATTAAAAGAGCACTTT------AATAAGACAATAATCT ********** ** ** ******* *** ** *** ***** **********HIVHXB2CG TTAAGCAATCCTCAGGAGGGGACCCAGAAATTGTAACGCACAGTTTTAATTGTGGAGGGGAY736838-gp120_ TTCAACCACCCTCAGGAGGAGATCTAGAAATTACAATGCATCATTTTAATTGTAGAGGGG ** * * * ********** ** * ******* ** *** ********** ******HIVHXB2CG AATTTTTCTACTGTAATTCAACACAACTGTTTAATAGTACTTGGTTTAATAGTACTTGGAAY736838-gp120_ AATTTTTCTATTGCAATACAACAAAACTGTTTAATAATATTTGCCTAGGAAATG---AAA ********** ** *** ***** ************ ** *** * * * *HIVHXB2CG GTACTGAAGGGTCAAATAACACTGAAGGAAGTGACACAATCACCCTCCCATGCAGAATAAAY736838-gp120_ CCATGGCGGGGTGTAATGACACT---------------ATCACACTTCCATGCAAGATAA * * **** *** ***** ***** ** ******* ****
New Tools
● Clinical vs. evolutionary.● Avoid assumptions that break current tools.● Suitable for a repeatable process in clinics or
data mining in research.● We are using:
● Wcurve for analysis.● TSP for clustering.● R for data management & display.
Wcurve
● Geometric abstraction of DNA.● Manufactured by a simple state machine.● Alignment at finer scale available using
geometry than character strings.● Avoids assumptions about transition
probabilities by taking the figure asis.
WCurve Generator is a State Machine
● C,A,T,G are assigned to corners of a square.● Successive points move halfway to the next
base's corner.
Wcurve for “CG”
● Curve shown in Blue.
● Halfway to C then G in X Y, single ‑steps in Z.
● Cyl. storage simplifies comparision.
Wcurve of Wild HIV1 POL GeneWcurve of Wild HIV1 POL
Wcurves of Wild & Drug Resistant Pol
Detail of Wild & Drug Resistant Pol
Distance Metric
● Bases are arranged in square to minimize effects of SNP's.
● Synonymous SNP's are usually in the same quadrant.
● Points within same quadrant have small difference, opposite quad's get larger.
Comparison Produces “Chunks”
● Comparison yields a list of chunks.● Curves are aligned within the chunk.● Summing chunks gives single value two curves.● Analyzing them in detail allows mining local
similarities and variations.● Grouping allows examination of crossover
recombination events.
Clustering: Traveling Salesman Problem
● The TSP is simple to describe, hard to solve:● Starting and finishing in the same city.● Visit a list of cities once each.● Minimize the distance (cost).
● Optimal solutions will cluster the nearby cities.● The problem was always in defining the
clusters.
Take a Walk and Cluster Your Genes
● Climer & Zhang, 2004.● Method for detecting N clusters:
● Add N dummy cities to the distance map.● Each one has the same, small distance to all other
cities (we use 220).● Dummy cities end up in the intercluster gaps.
● The process is trivial to implement: just add that many rows and columns to the original comparison matrix.
Displaying the Tour
● Mapping the tour onto a circle gives a good view of the distances.
● Coloring simplifies inspection.● Black dots for dummy cities.● Single type at the top (e.g. wild type).● Color successive data points using the “rainbow”
sequence with a large number of colors.● Sequences more alike get more similar colors.
Example with 8 DR, 100 Samples
Multiple uses for color sequence.
● Track individual over time.● Progression through colors shows history.● Clustering highlights progression towards drug
resistance.
● Track sample population.● Recycling the colors from one initial tour helps show
changes in successive graphs.● Simplifies tracking progression in anonymous
populations found in HIV treatment centers.
Visualizing Wcurves
● We use a WebGLbased package “WebCurve”.● Developed at IIT as a webfriendly solution for
examining 3D geometry.● Gracefully handles displaying 100+ sequences
at 10K bases each on a notebook computer.● Available from github, archive includes a web
server and code to generate files for display.
Summary
● Wcurve and TSP allow us to cluster genes.● Provides a more useful output in a clinical
setting.● Color coding the TSP results allows tracking
changes in a population or progression an individual over time.