杜嘉晨 2015.4.1 PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs

Embed Size (px)

DESCRIPTION

Introduction microRNAs(miRNAs) are non- coding RNAs that play important roles in gene regulation for cleavage or translational repression. Our main task is to develop an efficient classifier to distinguish real microRNA and pseudo plant mircoRNA.

Citation preview

PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs Outline Introduction Method Result and Discussion Conclusion Introduction microRNAs(miRNAs) are non- coding RNAs that play important roles in gene regulation for cleavage or translational repression. Our main task is to develop an efficient classifier to distinguish real microRNA and pseudo plant mircoRNA. Method Feature of plant pre-miRNAs Feature selection Training samples selection Classification based on svm Method Feature of plant pre-miRNAs Dinucleotide frequency %XYX,Y {A,C,G,U}(16) %G+C(1) Totally 17 features Thermodynamic and stability Minimum free entropy(MFE) etc. Totally 31 features Feature of plant pre-miRNAs Structural characteristic Structured triplet composition(32) Structured triplet composition from stem (32) Totally 64 features After feature extraction there are 112( ) features Totally used for classification. Structural characteristic Common structural characteristic features extraction Use ( to indicate paired nucleotide Use . to indicate unpaired nucleotide Considering the middle nucleotide among the three there are 32 (4*8) possible structure-sequence combinations for 3 adjacent nucleotides Count the frequency of each combination Structural characteristic (a) Use Ls sequence to replace big loops (b) Use Ls sequence to replace big bugles (c) Cut off unmatched sequences Feature subset selection Information gain: the measurement of the features discrimination ability. Feature similarity: represent the similarity between features. The similarity between two features is range from 0 to 1, 0 indicates that these two features are irrelevant, 1 indicates that they are totally same. Feature subset selection Suppose every feature is a node in the graph if there are two features whose Sim measurement is greater than some threshold (=0.49), then add an edge between these two corresponding nodes, and the weight of this edge is this Sim measurement. Novel features subset selection method based on graph abcd a b c10.8 d1 a b c d a0.5 b0.8 c0.4 d0.5 IG of featuresSim between two features Feature graph Every nodes Feature selection weight(FWS) is calculated by Select the node with highest FWS remove all the nodes adjacent to this node Repeat this procedure until there are only Isolated nodes in the graph These features corresponding to the remaining nodes in the graph are our ideal feature subset. Feature subset selection We select features with high IG In each category a)Primary features b)Energy and thermodynamic features c)Secondary structure features Finally there are 68 features after feature selection Training samples selection The representation of samples Distance between two samples Degree of coverage The degree of coverage of a sample s in a certain area of feature space is defined by the number of samples whose nearest neighbor is s. Training samples selection Assume that the number of samples in the i-th family is N i. V k is the feature vector corresponding to the k-th sample. c i is then calculated as, Suppose that the selection rate of sample space is 1/n. That is, N i /n samples in the i-th family are selected. The number of the selected samples is denoted as P i =N i /n The distance between the k-th sample (real pre-miRNA) v k and the central point c i is denoted as d vk ci. V tk means the transpose of vector v k. Then, the radius of the i-th family is r i, where r i =max(d vk ci )(1kN i ). Training samples selection Suppose that c i is the center of a circle, draw two circles with radius 0r i and (1/P i )r i, respectively. The region between these two circles is denoted as A 0. The degree of coverage for each sample s in A 0 is calculated and denoted as C(s). C(s) represents the number of samples in A 0 whose nearest neighbor sample is s. The sample s with the greatest C(s) value is selected as a training sample. We set (1/P i )r i as the step length and compute the degree of sample coverage in the region A k between two circles with the radius (1/P i )kr i and (1/P i )(k+1)r i (1kP i 1), respectively. The sample in A k with the largest degree of coverage is selected. Result and Discussion Positive data 128 families of micro RNA (1612) 431 other microRNA Total: 1906 Negative data 17 groups of pseudo microRNA(2122) Total: 2122 Positive data Total: 980 Negative data Total: 980 Sample selection Result and Discussion 68 features are the features selected by our algorithm 80 features contains no structural features from stem 51 features contains no structural features 115 features are the whole feature set Result and Discussion