Aligning seqeunces with W-curve and SQL

Standard tools analyze strings

● Blast, Fasta, and ClustalW combine strings with probability tables to compare sequences.– Assume that all differences are significant.

● This works well enough in most cases: Fly, chimp, and human genomes are easily comparable.

● Clustalomega adds HMM to the mix.– Does a better job keeping sequences compact.

● Catch: They don't handle messy DNA.

“Messy” sequences● Several classes of important problems

break the stringbased approaches:– Sequences which break probability tables.– Large indels.– Crossovers, inversions, and CNV's.– Discontinuous epitopes.– Microbiome studies with many source

mappings.

● These break basic assumptions required for characterstring comparisons.

Example: Cancer and CNV's

● At a smaller scale, CNV's look like crossovers.– Shotgun sequencing gets partial matches on

multiple genes at the CNV boundary.

● Cancer is often messy.– Nature, radiation, or chemical insults leave

the sequences scrambled.– Large indels or crossovers between

chromosomes are common.

Dealing with HIV-1

● Noncorrecting RNA virus: uniform rate of variation.– High rate of SNP's.– Also high rate of gaps.– Strains run from 8000+ to 10 000+ bases.

● Dualstrand package in viron:– Multistrain crossovers are common.

Example: gp120 & CD4 Site

● Try this: Align HXB2's gp120 sequence its HXB2's genome using ClustalW2.– gp120 is ~1500 bases long.

– Alignment using ClustalW2 stretches the gene past 5000 bases.

● Now try to analyze the CD4 binding site.– 60+ bases spread across a 1500base gene in

7 groups of 612 bases each.– Clades based on gene use largely white noise.

CD4 binding site in gp120

Clades Based on gp120

Combined work cycle

● Part of the problem is combining stages.– Alignment and comparison happen in one

step.

● Tools only compare full sequences rather than subsets.– Messy sequences require multistage

alignment.– This doesn't work with partial matches

required for crossovers, CNV's, or really large indels.

Alignments with the W-curve

● The Wcurve was originally developed for visual analysis of long sequences.

● Later adapted to computed alignments.● The process converts DNA strings to 3D

geometry.– Local state gives more detail for analysis than

character strings.– Comparing geometry produces alignments.

Mapping DNA to Geometry● Geometry is produced from DNA sequence.● Successive points move halfway to the corner

associated with a DNA base.

Example:“CG”● First point

is halfway from origin to (0,1).

● Second point is halfway to (1,0).

Important Properties

● The Wcurve balances variation and convergence.● Curves vary locally due to differences in sequence

that produce them.● SNP's changes the surrounding curve.● After the sequences converge, so do the curves.● This leaves a few bases of differences after a SNP

or GAP.

Convergence aftera SNP● Blue &

Green curves converge within a few bases of the SNP at base 3.

Global view: Wild, D-R HIV-1 POL

Local view: Wild, D-R HIV-1 POL

SQL & Alignments

● This approach is for alignment only.– Subsequences can be aligned in a second

pass.– Produces clades based on clinically

significant subsets of the sequence.

● While this can be done manually with Clustal, the Wcurve can automate the cycle.– Final alignments can be further processed for

scoring by any available method.

Storing Geometry

● The internals of most “geocoding” rely on geometric extensions to SQL.

● The standard is called “GIS”, implemented as “PostGIS” with the Postgres database.

● This allows storing Wcurves in a database and querying them.– Global convergence allows for comparison.– Fuzzy matching deals with variation.

W-curve & GIS

● Due to SNP's and gaps, we need some way to make a fuzzy comparison of the curves.

● Our current approach uses simple geometry to analyze the distances.– Template and sample vertexes are queried by

geometry.– Ultimately, the approach permits multiple

ways to store and query the curves.

Fixed templates, fuzzy samples

● We have many, often long templates.● There is one sample, which is often short.● Solution: Attach “fuzzyness” to the

sample:– Store template vertexes as points.– Store samples as circles or polygons.– Ask which template points are within the

sample circles.

Fuzzy Matching

Query output: ID, base, offset

● Select uses 'within' lookup on XY plane.– Extracts template sequence ID and Zaxis (base no).– Computes the difference of sample and template base

numbers (“offset”).

● Results are sorted by ID, offset, and template base.– Sequences of adjacent template base numbers with the

same offset are alignments.– No need for the sample bases since they can be re

computed from template base + offset.

SQL for extraction is readable

● Only three values are required from the initial query.

● From two tables.

● Where the fuzzy vertex contains the fixed one.

select b.seq_id, b.base_no, a.base – b.base as 'delta',from fuzzy a, fixed bwhere st_contains ( a.vertex, b.vertex )order by b.seq_id, b.delta, b.base_no;

Query result: “chunks”

● Runs of template base numbers are collapsed into “chunks” of alignment.– [ SeqID, Start, Stop, Offset ] describe the alignment.– Process makes allowances for small (<7 b.p.) gaps.

● Amenable to parallel processing:– Database nodes perform “within” query.– Summarized into chunks.– Gathered for final processing.

Post-processing chunks

● Sort by:

start base + offset ascending.

stop base + offset descending

● Chunks are ordered along the sample, longest to shortest.

● Now the chunks can be filtered depending on their source and expected use.

Filtering the Chunks

● Short reads, crossovers, CNV's, tag searches will all present different patterns in the chunks.– Short reads look for starttoend matches.

– Crossovers are tiled endtoend matches from different sequences.

– CNV's have two sequences with a break in the offsets equal to the CNV size.

– Tags show up as multiple matches of the same length from multiple templates.

– Inversions match a reversed copy of the sasmple.

Crossovers: Adjacent Chunks● Crossovers will have two chunks from

different sequences.● Whatever their offsets, the two chunks

add up to the full sample's length with minimal gap or overlap.

Example: crossover gp120

● Sequences with adjacent bases and the same offset.

...1 6880 +931 6881 +931 6882 +93 ... 1 7665 +93 ...2 6307 -822 6308 -822 6310 -822 6311 -82 ... 2 7054 -82



● Get collapsed into chunks.

[ 1 6880 7665 +93 ]

[ 2 6307 7054 -82 ]




● Sort by sample bases from start+offset and stop + offset.

[ 2 6225 6972 -82 ][ 1 6973 7758 +93 ]




● Sort by sample bases from start+offset and stop + offset.

● Result is a tiling of gp120: 6225 – 7758.

[ 2 6225 6972 -82 ][ 1 6973 7758 +93 ]

Working with short reads

● The previous example shows the bases on a known sequence, gp120.

● For short reads the sample base numbering will be 1..N.– Same basic results, with chunk start and

stop from 1 .. N and large offsets.– Known sample bases simplify searches for

tags or relative positions.

CNV's have a similar pattern

● Three chunks with ABA sequence.● Base offset in second block of A is the

same size as the chunk between them.

Other filters

● Inversions can be detected by reversing the sample sequence.

● Copies of genes will show up as multiple sequences with full coverage.

Occam's Razor

● All of this is done with the simple arithmetic:– Numeric sort.

– Comparing integers to 1 or 7.

– Subtract base numbers.

– Adding offsets.

● This keeps the algorithm easy to program and validate.

Aligning sub-sequences

● Aligning subsequences simply reduces the chunk sizes.– Start with chunks that overlap the subsequence.– Generate new chunks with more restrictive start and

stop values.– The sequence, and offset do not change.

● Again, nothing more than simple arithmetic is required for the subalignments.

Samples start extra-fuzzy

● The leading few bases of a sample are generated from (0,0,0).

● Template sequences begin at a point within the curve.

● This can be easily handled by using a slightly larger radius for the first few sample bases.

Indeterminate bases

● FASTQ data includes indeterminate bases and quality stores.– We can generate multiple points for a base.– These can be stored as “multipoint” in GIS.– Compared using a convex hull or nearest

neighbor to the template.– Explicit use of fuzzy math to combined with

quality values?

Parallel processing

● Recursive algorithms in Blast, Fasta, Clustal are not amenable to piecewise or parallel processing.

● The query used here is trivial to distribute across mulitple servers.

● Compressing rows into chunks is also suitable for multinode processing.

● The cycle is suitable for parallel or cloud computing environments.

Summary

● Using a standalone alignment step is more flexible for messy DNA.

● The Wcurve adds enough state to the DNA sequence that we can query nearby vertexes.

● A simple query and filtering allows us to query sequences for alignment, including postprocessing for subsequence alignment.

● This is a tool for automating alignment of messy sequences.

Technology

Aligning seqeunces with W-curve and SQL