Devin Petersohn Poster

To convert a DNA sequence into a grayscale image, we first convert each character into a unique specific value: A=0, C=1, G=2, T=3 Then, in order to convert those values into a 4-‐bit grayscale value (gray color values from 0-‐15), we use the following formula: (P1*4)+(P2) Where P1 is the character in the first posiHon, and P2 is the second The resulHng grayscale values form the pixels of images that represent the original sequence. In order to get a 10x10 image, a sequence of 101 base pairs is required. Example: CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT P1 = C = 1, P2 = A = 0 => (1*4) + (0) = 4 Using a sliding window, the second posiHon becomes the first. CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT P1 = A = 0, P2 = T = 3 => (0*4) + (3) = 3 Each two-‐character sequence receives a unique value from 0-‐15, which corresponds to its grayscale value in the 10x10 image:

Abstract

Co-Occurrence Matrix and Texture Measurement Locating Potential DNA Mutations

Discussion

IdenHfying the Long Ultra Similar Elements (LUSEs) in genomes can yield a myriad of new informaHon regarding the result of a geneHcally and evoluHonarily significant mutaHon. However, current methods of idenHfying LUSEs cannot capture every possible mutaHon (inserHon, deleHon, and base pair subsHtuHon) without an exhausHve pair-‐wise comparison using the Levenshtein Similarity measurement. Alignment algorithms aYempt to solve this problem, but can only calculate the maximum consecuHvely similar elements in a string of base pairs. We have developed an image-‐based method of idenHfying LUSEs in genomes that has a strong correlaHon to the Levenshtein Similarity measurement. Our approach first converts a sequence into a 10x10 grayscale image. Then, using exisHng co-‐occurrence matrix based texture feature metrics, we generate a unique feature vector for each sequence by which other sequences can be compared. These feature vectors can then be ploYed and, using a clustering algorithm, we will then be able to idenHfy clusters of sequences that share a Levenshtein Similarity greater than 90% (or another threshold of our choosing). Because of the correlaHon between clusters and the Levenshtein Similarity measurement, we can avoid pair-‐wise comparisons altogether. Because there are no pairwise comparisons, these algorithms can run in parallel using a MapReduce funcHon in a Big Data Ecosystem (Hadoop), offering a suitable soluHon to this Big Data problem that is scalable to the amount of hardware available. The final product will be a hash funcHon that can return all clustered LUSEs very quickly for biology researchers to access in real Hme.

The final product is a searchable database for evoluHonary biologists to be able to upload and compare organism genomes against all other genomes already in the database.

The Levenshtein Similarity measurement calculates similarity between strings based on the minimum number of deleHons, inserHons, and subsHtuHons it takes to get from one string to another [7].

Retrieved from: hYp://images.flatworldknowledge.com/ballgob/ballgob-‐fig19_015.jpg

Purpose of this approach:

Work in Big Data Ecosystem Algorithm can run in parallel Scalable performance to amount of hardware available No pairwise comparison

Contrast Homo-‐geneity Entropy Dissim-‐

ilarity Contrast & Homogen.

Homogen. & Entropy

Entropy & Dissim.

Contrast & Entropy

Contrast & Dissim.

Homogen. & Dissim.

Contrast, Homogen., & Entropy

Contrast, Homogen., & Dissim.

Contrast, Entropy, & Dissim.

Homogen., Entropy, & Dissim.

Contrast, Homogen., Entropy, & Dissim.

CorrelaHon 0.8738 0.4313 0.7540 0.8691 0.8270 0.7884 0.8861 0.8697 0.8737 0.8198 0.8648 0.8507 0.8986 0.8750 0.8880

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

Correla@

on with

Leven

shtein Sim

ilarity

(1 is perfectly correlated)

Texture Feature Measurement Method(s)

Correla@on between Levenshtein Distance and Texture Feature Measurement Methods

The Co-‐Occurrence Matrix is created by counHng the number of grayscale pixel values that occur near another in a given image [4]. From the Co-‐Occurrence Matrix, we can generate features with exisHng methods [4].

Contrast Dissimilarity

Homogeneity Entropy

These feature measurement metrics are used to reduce the co-‐occurrence matrix down to values that can be measured or ploYed against other images [4]. Below, the graph details the correlaHon between Levenshtein Similarity and all possible combinaHons of the above feature metrics. The most correlated combinaHon of metrics is Contrast, Entropy, and Dissimilarity with a strong 0.8986 correlaHon (1 is perfectly correlated).

Image

Pixel that is compared against all neighbors

Window

Window PosiHon 1

Window PosiHon 2

Query (Sequence)

Feature Metric CalculaHon

User (Start/End)

Cluster with Similar

Sequences

User submits query sequence of at least 101

Feature Metrics are generated from query

Metrics are ploYed and clustered

Finding LUSE Overview

These metrics can next be ploYed in 3-‐dimensional space and clustered using the K Means algorithm. Because of the strong correlaHon, each cluster will represent a sequence of a measurable similarity threshold.

Contrast

Entrop

y

Benefits to approach: MapReduce works in parallel => very fast: Linear Hme vs. ExponenHal Same Hme cost to compare 1 vs. 1 and 1 vs. all Scalable to amount of hardware available: More nodes = BeYer Performance Setup can handle enHre genomes to be compared at once Only need to run a sequence once – results will conHnue to be added as database grows Poten@al Uses: IdenHfy Ultra Conserved Elements (UCEs) [1] IdenHfy evoluHonarily significant mutaHons PotenHal for medical uses Disease diagnosis, GeneHc Research, etc. Others What’s Next: TesHng different clustering algorithms – Sop Clustering Implement and test Spark June PublicaHon

Yellow area is calculated, blank pixels are not

[1] Reneker J, Lyons E, Conant GC, Pires JC, Freeling M, Shyu CR, Korkin D.Proc Natl Acad Sci U S A. 2012 May 8;109(19):E1183-‐91. doi: 10.1073/pnas.1121356109. Epub 2012 Apr 10. [2] J. Dean and S. Ghemawat, "Mapreduce: Simplified data processing on large clusters," ACM Commun., vol. 51, Jan. 2008, pp. 107-‐113. [3] Hadoop, hYp://hadoop.apache.org/ [4] Co-‐Occurrence Matrix, hYp://www.fp.ucalgary.ca/mhallbey/texture_calculaHons.htm [5] Apache Spark, hYp://spark.apache.org/ [6] Apache Hbase, hYp://hbase.apache.org/ [7] Levenshtein, Vladimir I. (February 1966). "Binary codes capable of correcHng deleHons, inserHons, and reversals". Soviet Physics Doklady 10 (8): 707–710.

MapReduce, Hadoop, Spark, & HBase – Big Data Ecosystem

Retrieved from: hYp://hadoop.apache.org

MapReduce Overview Cluster Setup 10 Intel NUC computers 1 Master Node: 16GB RAM Dual Core 2.0GHz CPU 1TB Hard Disk Space 480GB Solid State Drive 9 Compute Nodes 8GB RAM Dual Core 2.0GHz CPU 1TB Hard Disk Space 480GB Solid State Drive

Retrieved from: hYp://spark.apache.org

// Map Func@on 1: input <k,v> k is offset for current file block (in bytes); v is a sequence in chromosome C

1: v = P(v) // remove invalid characters 2: for i = 0 to m-‐n do{ 3: FV = generateFV(v[i to i+n]) //generate feature

vector 4: start_pos = i + k 5: return (FV, (start_pos, C)) }

// Reduce Func@on 1: input <k,v> k is the feature vector (FV); v is the star@ng posi@on of the subsequence w.r.t the chromosome sequence

1: pos = merge(v) 2: return (k, pos)

// Map Func@on 2: input <k,v> k is feature vector; v is the list of posi@ons matching the feature vector 1: k = normalize(k) //normalize data 2: return (k, v)

// Reduce Func@on 2: input <k,v> k is the normalized feature vector; v is the list of star@ng posi@ons 1: cl = kmean(k) //cluster data using k means 2: return (cl, v)

Orig

inal Data

(Seq

uence)

Mapper 1 <FV, (Ch ID, Pos)>



Mapper n <FV, (Ch ID, Pos)>

Output to HBase

<FV, (List of Pos IDs)> Reducer 1



<FV, (List of Pos IDs)> Reducer n

Master

Nod

e

. . . .

. . . .

Retrieved from: hYp://hbase.apache.org

Co-‐occurrence Matrix FV calculated Aggregate elements with matching FV

Iden@fying Long Ultra Similar Elements (LUSEs) in Genomes Using Image Based Texture Co-‐Occurrence Matrix

Devin Petersohn1 and Chi-Ren Shyu (Mentor)1,2

1Department of Computer Science, College of Engineering, 2MU Informatics Institute, University of Missouri

References

HBase Table Schema Feature Vector

<Contrast, Entropy, Dissimilarity> Table of ordered Pairs

(Ch ID, Pos) K Mean Cluster ID

(Calculated 2nd IteraHon)

. . . .

. . . .

. . . .

<Contrast, Entropy, Dissimilarity>

Ch ID 1 Pos 1

K Mean Cluster ID Ch ID 2 Pos 2

. . .

. . .

Ch ID n Pos n

Shuffling

Acknowledgements

This project was sponsored by the MU College of Engineering Undergraduate Honors Research Program

Undergraduate Research Forum – Spring 2014

0 20 40 60 80

100 120 140 160 180 200

0 250 500 750 1,000 1,250 1,500 1,750 2,000 2,250

Time (m

inutes)

Number of Base Pairs (in Millions)

Running Time for 1st MapReduce Func@on on a 6 Node Cluster

Documents

Devin Petersohn Poster