P0126557 report

1

Title: An Evolutionary algorithm approach for Feature generation from Sequence data and its

application to DNA Splice site prediction.

Authors: Uday Kamath, Keneth A. De Jong, Amarda Shehu, Jack Compton, and Rezarta Islamaj-

Dogan.

Source: IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 9, No. 5,

September/October 2012.

Speaker: Nguyen Dinh Chien (阮庭戰), Student ID: P0126557.

Sequence-based classification aims to discover signals or features hidden in the sequence data in

prediction information. That sequence data correlate with the sought property a discriminate between

sequences that contain the property and those that do not. Reduction techniques, such as Information

gain, Chi-Square, Mutual Information, and KL-distance, are additionally employed to further reduce

the size of feature set. Reducing is important to propose feature generation methods that are not

limited by biological insight, the considered type of feature or the ability to enumerate features.

Transcription of an eukaryotic DNA sequence into mRNA occurs only after enzymes splice away

nocoding regions (intron) from pre-mRNA sequence to leave only coding regions (exons), so that,

prediction of splice sites is a fundamental component of the gene-finding problem.

Splice site prediction is a difficult problem. AG and GT (GU) cannot be used as features due to

their abundance in non-splice site sequences.

Many studies have demonstrated the advantages of EAs for feature generation in different domains,

such as, Fast Genetic Selection, Nearest Neighbor Classifier… All of these methods obtain predictive

features from sequence data have shown success in diverse bioinformatics problems. The common

structure of Evolution algorithm (EA) is showed in following figure.

In this study, they explored the use of evolutionary algorithms to search a large and complex feature

space. They obtained features from sequence data that can significantly improve the classification

accuracy of a Support Vector Machine (SVM). This approach is called FG-EA, means Feature

Generation with Evolution Algorithm, and they used Genetic Programming (GP) techniques to evolve

the kind of structures illustrated in following figure.

2

Comparison with state-of-the-art feature-based classification method, they realized that FG-EA

features significantly improve the classification performance. The FG-EA algorithm generates

complex features represented internally as GP trees and evaluates them on splice site training data

using a surrogate fitness function.

- The features in the hall of fame transform input sequence data into features vectors.

- SVM operating over the feature vectors finally allow evaluating the accuracy of the resulting

classifier.

With this diagram, they employ to predict DNA splice sites. The top features obtained after the

exploration of the feature space with FG-EA allow transforming input sequences into feature vectors

on which a SVM classifier can then operate.

The above diagram showed the main steps in FG-EA algorithm. Features/individuals are evolved until

a maximum number Gen_Max of generations has been reached. The mutation and crossover operators

detailed bellow are employed to obtain new features in a generation. Top features of a generation are

contributed to a growing hall of fame which then in turn contributes randomly selected features to seed

the next generation.

The tree represents the feature, “GTT with length=3 in position 30 AND GTT with length=3 in

position 36.” Using an efficient fitness function, they identified a set of candidate features (a hall of

fame) to be used as input to a standard SVM classification procedure.

3

In this paper, the Authors proposed some type of features, such as, Compositional features and

positional features, Correlational features, Conjunctive and disjunctive features

Generating random initial features from 0 consists of N=15,000 features. The tree representing

features are generated using the well-known ramped half-and-half generative method, which includes

both Full and Grow techniques. These techniques obtain a mixture of full-balanced trees and bushy

trees with each technique is employed with equal probability of 0.5. Subsequent generations are

evolved using standard GP (Genetic Programming) selection, crossover, and mutation mechanisms.

Given a set of m features extracted from the hall of fame to serve as parents in a generation, the rest of

GenSize-m features are generated using the mutation and crossover operators. They employed three

breeding pipelines: mutation-ERC, mutation, and Crossover.

The fitness function is key to achieving an efficient and effective EA search heuristic. FG-EA uses a

surrogate fitness function given by:

)(*)(,

fIGC

CfFitness

f

f – feature; the ratio C+,f/C+ is weighted by the information gain (IG)

Given m class attributes:

m

i ii

m

i i

m

i iii fcPfcPfPfcPfcpfPcPcPfIG11 1

))|(log().|(.)())|(log().|().())(log().()(

The ℓ fittest individuals of a generation are added to a hall of fame, which keeps the fittest individuals

of each generation. Maintaining a hall of fame guarantees that fit individuals will not be lost or

changed. They used hall of fame with two reasons, such as, Maintaining diversity in the solution space,

and Guarantee optimal performance.

In this study, they used ℓ=250 fittest individuals of a generation, and a generation seeds its population

with m=100 randomly chosen individuals from the current set of features in the hall of fame.

The set of features in the hall of fame can be further narrowed through Recursive Feature Elimination

(RFE). RFE starts with a large feature set and gradually reduce this set by removing the least successful

features until a stopping criterion is met. They employed RFE to estimate the impact of feature set sizes

on the precision and accuracy of the classification, and directly compare with existing work.

SVMs are popular and successful in a wide variety of binary classification problems. First, they

mapped sequence data into a Euclidean vector space. Second, they selected a kernel function to map

the vector space into higher dimensional and more effective Euclidean space. And final, they turn

parameters for the kernel and other SVM parameters to improve performance.

Compare the classification performance to two different groups of state-of-the-art methods in splice

site prediction, feature-based, and kernel-based. Feature-based: (FGA and GeneSplicer) extracted from

the 2005 NCBI RefSeq collection of 5057 human pre-mRNA sequences

(http://www.ncbi.nlm.nih.gov). Used to extract 51008 positive (contain splice sites) and 200000

negative sequences, and 25504 acceptor and 25504 donor consist of 162 nucleotides each (80

upstream + AG|GT + 80 downstream). Kernel-based: (WD and WDS) extracted from the worm data

set with EST sequences (http://www.wormbase.org). In this group, they used 64844 donors and 64838

acceptor splice site sequences. Each sequence is 142 nucleotides long (60+AG|GT+80).

http://www.ncbi.nlm.nih.gov/

http://www.wormbase.org/

4

There are two sets of classification experiments are conducted, such as, compare the performance of

FG-EA to FGA and GeneSplicer, and Compare with WD and WDS methods. The values of these

parameters can be found on the website http://www.cs.gmu.edu/~ashehu/?q=OurTools. They measure

performance in terms of 11ptAVG, FPR, auROC, and auPRC. For example, the following figure show

that FPR over recall (left: acceptor, right: donor) are plotted for the B2Hum testing data set.

And, in two following table, the authors compared auROC values with auPRC values on 40K

Sequences sampled (left hand), and on 10 different sets of 360K sequences sampled (right hand) from

the Worm data sets.

They divide the hall of fame features in three types of subsets. First subset is all composition features;

second subset is all region-specific

compositional, positional, correlational

features; and third subset is all remaining

features include conjunctive and

disjunctive features. In the right-hand table,

we can see that IG (information gain) sums

of subsets of features evaluated over

acceptor and donor data.

FG-EA outperforms state-of-the-art feature generation methods in splice site classification. FG-EA

reveals the significant role of novel complex conjunctive and disjunctive features. The proposed FG-

EA algorithm can easily be employed in other prediction problems on biological sequences. Further

extensions of the FG-EA can combine the evolution of features with evolution of SVM kernels for

greater classification accuracy. Plan on employing regular expressions to further combine and reduce

the bloat in the expressions and so improve readability and performance.

http://www.cs.gmu.edu/~ashehu/?q=OurTools

Technology

P0126557 report