Upload
devin-petersohn
View
112
Download
2
Embed Size (px)
Citation preview
To convert a DNA sequence into a grayscale image, we first convert each character into a unique specific value: A=0, C=1, G=2, T=3 Then, in order to convert those values into a 4-‐bit grayscale value (gray color values from 0-‐15), we use the following formula: (P1*4)+(P2) Where P1 is the character in the first posiHon, and P2 is the second The resulHng grayscale values form the pixels of images that represent the original sequence. In order to get a 10x10 image, a sequence of 101 base pairs is required. Example: CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT P1 = C = 1, P2 = A = 0 => (1*4) + (0) = 4 Using a sliding window, the second posiHon becomes the first. CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT P1 = A = 0, P2 = T = 3 => (0*4) + (3) = 3 Each two-‐character sequence receives a unique value from 0-‐15, which corresponds to its grayscale value in the 10x10 image:
Abstract
Co-Occurrence Matrix and Texture Measurement Locating Potential DNA Mutations
Discussion
IdenHfying the Long Ultra Similar Elements (LUSEs) in genomes can yield a myriad of new informaHon regarding the result of a geneHcally and evoluHonarily significant mutaHon. However, current methods of idenHfying LUSEs cannot capture every possible mutaHon (inserHon, deleHon, and base pair subsHtuHon) without an exhausHve pair-‐wise comparison using the Levenshtein Similarity measurement. Alignment algorithms aYempt to solve this problem, but can only calculate the maximum consecuHvely similar elements in a string of base pairs. We have developed an image-‐based method of idenHfying LUSEs in genomes that has a strong correlaHon to the Levenshtein Similarity measurement. Our approach first converts a sequence into a 10x10 grayscale image. Then, using exisHng co-‐occurrence matrix based texture feature metrics, we generate a unique feature vector for each sequence by which other sequences can be compared. These feature vectors can then be ploYed and, using a clustering algorithm, we will then be able to idenHfy clusters of sequences that share a Levenshtein Similarity greater than 90% (or another threshold of our choosing). Because of the correlaHon between clusters and the Levenshtein Similarity measurement, we can avoid pair-‐wise comparisons altogether. Because there are no pairwise comparisons, these algorithms can run in parallel using a MapReduce funcHon in a Big Data Ecosystem (Hadoop), offering a suitable soluHon to this Big Data problem that is scalable to the amount of hardware available. The final product will be a hash funcHon that can return all clustered LUSEs very quickly for biology researchers to access in real Hme.
The final product is a searchable database for evoluHonary biologists to be able to upload and compare organism genomes against all other genomes already in the database.
The Levenshtein Similarity measurement calculates similarity between strings based on the minimum number of deleHons, inserHons, and subsHtuHons it takes to get from one string to another [7].
Retrieved from: hYp://images.flatworldknowledge.com/ballgob/ballgob-‐fig19_015.jpg
Purpose of this approach:
Work in Big Data Ecosystem Algorithm can run in parallel Scalable performance to amount of hardware available No pairwise comparison
Contrast Homo-‐geneity Entropy Dissim-‐
ilarity Contrast & Homogen.
Homogen. & Entropy
Entropy & Dissim.
Contrast & Entropy
Contrast & Dissim.
Homogen. & Dissim.
Contrast, Homogen., & Entropy
Contrast, Homogen., & Dissim.
Contrast, Entropy, & Dissim.
Homogen., Entropy, & Dissim.
Contrast, Homogen., Entropy, & Dissim.
CorrelaHon 0.8738 0.4313 0.7540 0.8691 0.8270 0.7884 0.8861 0.8697 0.8737 0.8198 0.8648 0.8507 0.8986 0.8750 0.8880
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Correla@
on with
Leven
shtein Sim
ilarity
(1 is perfectly correlated)
Texture Feature Measurement Method(s)
Correla@on between Levenshtein Distance and Texture Feature Measurement Methods
The Co-‐Occurrence Matrix is created by counHng the number of grayscale pixel values that occur near another in a given image [4]. From the Co-‐Occurrence Matrix, we can generate features with exisHng methods [4].
Contrast Dissimilarity
Homogeneity Entropy
These feature measurement metrics are used to reduce the co-‐occurrence matrix down to values that can be measured or ploYed against other images [4]. Below, the graph details the correlaHon between Levenshtein Similarity and all possible combinaHons of the above feature metrics. The most correlated combinaHon of metrics is Contrast, Entropy, and Dissimilarity with a strong 0.8986 correlaHon (1 is perfectly correlated).
Image
Pixel that is compared against all neighbors
Window
Window PosiHon 1
Window PosiHon 2
Query (Sequence)
Feature Metric CalculaHon
User (Start/End)
Cluster with Similar
Sequences
User submits query sequence of at least 101
Feature Metrics are generated from query
Metrics are ploYed and clustered
Finding LUSE Overview
These metrics can next be ploYed in 3-‐dimensional space and clustered using the K Means algorithm. Because of the strong correlaHon, each cluster will represent a sequence of a measurable similarity threshold.
Contrast
Entrop
y
Benefits to approach: MapReduce works in parallel => very fast: Linear Hme vs. ExponenHal Same Hme cost to compare 1 vs. 1 and 1 vs. all Scalable to amount of hardware available: More nodes = BeYer Performance Setup can handle enHre genomes to be compared at once Only need to run a sequence once – results will conHnue to be added as database grows Poten@al Uses: IdenHfy Ultra Conserved Elements (UCEs) [1] IdenHfy evoluHonarily significant mutaHons PotenHal for medical uses Disease diagnosis, GeneHc Research, etc. Others What’s Next: TesHng different clustering algorithms – Sop Clustering Implement and test Spark June PublicaHon
Yellow area is calculated, blank pixels are not
[1] Reneker J, Lyons E, Conant GC, Pires JC, Freeling M, Shyu CR, Korkin D.Proc Natl Acad Sci U S A. 2012 May 8;109(19):E1183-‐91. doi: 10.1073/pnas.1121356109. Epub 2012 Apr 10. [2] J. Dean and S. Ghemawat, "Mapreduce: Simplified data processing on large clusters," ACM Commun., vol. 51, Jan. 2008, pp. 107-‐113. [3] Hadoop, hYp://hadoop.apache.org/ [4] Co-‐Occurrence Matrix, hYp://www.fp.ucalgary.ca/mhallbey/texture_calculaHons.htm [5] Apache Spark, hYp://spark.apache.org/ [6] Apache Hbase, hYp://hbase.apache.org/ [7] Levenshtein, Vladimir I. (February 1966). "Binary codes capable of correcHng deleHons, inserHons, and reversals". Soviet Physics Doklady 10 (8): 707–710.
MapReduce, Hadoop, Spark, & HBase – Big Data Ecosystem
Retrieved from: hYp://hadoop.apache.org
MapReduce Overview Cluster Setup 10 Intel NUC computers 1 Master Node: 16GB RAM Dual Core 2.0GHz CPU 1TB Hard Disk Space 480GB Solid State Drive 9 Compute Nodes 8GB RAM Dual Core 2.0GHz CPU 1TB Hard Disk Space 480GB Solid State Drive
Retrieved from: hYp://spark.apache.org
// Map Func@on 1: input <k,v> k is offset for current file block (in bytes); v is a sequence in chromosome C
1: v = P(v) // remove invalid characters 2: for i = 0 to m-‐n do{ 3: FV = generateFV(v[i to i+n]) //generate feature
vector 4: start_pos = i + k 5: return (FV, (start_pos, C)) }
// Reduce Func@on 1: input <k,v> k is the feature vector (FV); v is the star@ng posi@on of the subsequence w.r.t the chromosome sequence
1: pos = merge(v) 2: return (k, pos)
// Map Func@on 2: input <k,v> k is feature vector; v is the list of posi@ons matching the feature vector 1: k = normalize(k) //normalize data 2: return (k, v)
// Reduce Func@on 2: input <k,v> k is the normalized feature vector; v is the list of star@ng posi@ons 1: cl = kmean(k) //cluster data using k means 2: return (cl, v)
Orig
inal Data
(Seq
uence)
Mapper 1 <FV, (Ch ID, Pos)>
Mapper 2 <FV, (Ch ID, Pos)>
Mapper 3 <FV, (Ch ID, Pos)>
Mapper n <FV, (Ch ID, Pos)>
Output to HBase
<FV, (List of Pos IDs)> Reducer 1
<FV, (List of Pos IDs)> Reducer 2
<FV, (List of Pos IDs)> Reducer 3
<FV, (List of Pos IDs)> Reducer n
Master
Nod
e
. . . .
. . . .
Retrieved from: hYp://hbase.apache.org
Co-‐occurrence Matrix FV calculated Aggregate elements with matching FV
Iden@fying Long Ultra Similar Elements (LUSEs) in Genomes Using Image Based Texture Co-‐Occurrence Matrix
Devin Petersohn1 and Chi-Ren Shyu (Mentor)1,2
1Department of Computer Science, College of Engineering, 2MU Informatics Institute, University of Missouri
References
HBase Table Schema Feature Vector
<Contrast, Entropy, Dissimilarity> Table of ordered Pairs
(Ch ID, Pos) K Mean Cluster ID
(Calculated 2nd IteraHon)
. . . .
. . . .
. . . .
<Contrast, Entropy, Dissimilarity>
Ch ID 1 Pos 1
K Mean Cluster ID Ch ID 2 Pos 2
. . .
. . .
Ch ID n Pos n
Shuffling
Acknowledgements
This project was sponsored by the MU College of Engineering Undergraduate Honors Research Program
Undergraduate Research Forum – Spring 2014
0 20 40 60 80
100 120 140 160 180 200
0 250 500 750 1,000 1,250 1,500 1,750 2,000 2,250
Time (m
inutes)
Number of Base Pairs (in Millions)
Running Time for 1st MapReduce Func@on on a 6 Node Cluster