16
JANUARY 2016 VOL 2 ISSUE 1 MUSCLE v/s T-COFFEE : An overview and different aspects Genetic Algorithm: Explanation and Perl Code “The greatest leap in bioinformatics is to predict secondary structure of protein” - Charles Wins

BIOINFORMATICS REVIEW - JANUARY 2016 ISSUE

Embed Size (px)

DESCRIPTION

January 2016 issue of Bioinformatics Review. Available via http://bioinformaticsreview.com

Citation preview

JAN U ARY 2016 VOL 2 ISSUE 1

MUSCLE v/s T-COFFEE :

An overview and different aspects

Genetic Algorithm: Explanation and Perl Code

“The greatest leap in

bioinformatics is to

predict secondary

structure of protein”

- Charles Wins

Public Service Ad sponsored by IQLBioinformatics

Contents

January 2016

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Topics

03

22

34

34

Programming

CADD

Algorithms

Tools

Editorial.... 5

HTSeq : A Python framework to analyze high throughput sequencing data 06

Active learning in drug - target interactions 14

Genetic Algorithm: Explanation and Perl Code 08

MUSCLE v/s T-COFFEE : An overview and different aspects 12

CHIEF EDITOR

Dr. PRASHANT PANT

EDITORIAL

SECTION EDITORS

TARIQ ABDULLAH ALTAF ABDUL KALAM

MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS

REPRINTS AND PERMISSIONS

You must have permission before reproducing any material from Bioinformatics Review. Send E-mail

requests to [email protected]. Please include contact detail in your message.

BACK ISSUE

Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com

at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery,

subject to availability. Pre-payment is required

CONTACT

PHONE +91. 991 1942-428 / 852 7572-667

MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025

STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as [email protected]

PUBLICATION INFORMATION

Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social

and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015

Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used

under licence by SEWA trust. Published in India

EXECUTIVE EDITOR FOUNDING EDITOR

FOZAIL AHMAD MUNIBA FAIZA

EDITORIAL: Welcoming BiR in its

2nd year

Bioinformatics, being one of the best field in terms of future prospect, lacks

one thing - a news source. For there are a lot of journals publishing a large

number of quality research on a variety of topics such as genome analysis,

algorithms, sequence analysis etc., they merely get any notice in the

popular press.

One reason behind this, rather disturbing trend, is that there are very few

people who can successfully read a research paper and make a news out of

it. Plus, the bioinformatics community has not been yet introduced to

research reporting. These factors are common to every relatively new (and

rising) discipline such as bioinformatics.

Although there are a number of science reporting websites and portals,

very few accept entries from their audience, which is expected to have

expertise in some or the other field.

Bioinformatics Review has been conceptualized to address all these

concerns. We will provide an insight into the bioinformatics - as an industry

and as a research discipline. We will post new developments in

bioinformatics, latest research.

We will also accept entries from our audience and if possible, we will also

award them. To create an ecosystem of bioinformatics research reporting,

we will engage all kind of people involved in bioinformatics - Students,

professors, instructors and industries. We will also provide a free job listing

service for anyone who can benefit out of it.

EDIT

OR

IAL

Dr. Prashant Pant

Editor

Letters and responses:

[email protected]

Bioinformatics Review | 6

HTSeq : A Python framework to analyze high throughput sequencing data Muniba Faiza

Image Credit: Google Images

“HTSeq is a Python library which easily develops the scripts required to fulfill a particular task on the HT data.”

igh throughput sequencing

is most widely used as it

saves a lot of time and

provide good results, and

produces a huge amount of data

which is difficult to manage and

especially the tasks and operations

performed on it are also very

difficult. To ease this purpose, a

Python framework have been

introduced by Simon Anders and

team members, this framework is

known as “HTSeq”.HTSeq is a Python

library which easily develops the

scripts required to fulfill a particular

task on the HT data. Basically,HTSeq

reads various formats and break it

down into recognized strings of

characters for further analysis. It also

consists of different classes genomic

coordinates, sequences, sequencing

reads, alignments, gene model

information, etc.

Two stand-alone applications have

also been developed along with

HTSeq, namely, htseq-qa for read

quality assessment and htseq-count

for preprocessing RNA-Seq

alignments for analyzing differential

expression.

HTSeq can read various formats such

as FASTA, FASTQ (short reads),

SAM/BAM (short-read

alignments). Wherever appropriate,

different parsers will yield the same

type of record objects. For example,

the

record class SequenceWithQualities is

used whenever sequencing read with

base-call qualities needs to be

presented, and hence yielded by

the FastqParser class and also

present as a field in the

SAM_Alignment objects yielded by

SAM_Reader or BAM_Reader parser

objects (Fig. 1). There are some

specific classes to represent Genomic

Position and Genomic Intervals of the

sequence. In order to achieve good

performance, various parts of HTSeq

is written in ‘Cython’ ( a tool which

translates Python code augmented

with C).

H

BIOINFORMATICS PROGRAMMING

Bioinformatics Review | 7

Fig. 1. ( a) The SAM_Alignment class

as an example of an HTSeq

data record: subsets of the content

are bundled in object-valued fields,

using classes (here

SequenceWithQualities and

GenomicInterval) that are also used

in other data records to provide a

common view on diverse data types.

( b) The cigar field in a

SAM_alignment object presents the

detailed structure of a read

alignment as a list of CigarOperation.

This allows for convenient

downstream processing of

complicated alignment

structures, such as the one given by

the cigar string on top and illustrated

in the middle. Five CigarOperation

objects, with slots for the columns

of the table (bottom) provide the

data from the cigar string, along

with the inferred coordinates of the

affected regions in read (‘query’)

and reference.

HTSeq also consists of a class which

deals with the gapped alignments,

namelySAM_Alignment, with multipl

e alignments and with paired-end

data. HTSeq provides a

function,pair_SAM_alignments_with

_buffer, to pair up the alignment

records by keeping a buffer of reads

whose end pair has not yet been

found, and so facilitates processing

data on the level of sequenced

fragments rather than reads. HTSeq

also facilitates the storage of

genome-position-dependent data,

which means that each base pair

position on the genome can be

given a particular value that can be

easily stored and retrieved by simply

entering the same value.

The script htseq-qa is a simple tool

for initial quality assessment of

sequencing runs. It produces plots

that summarize the nucleotide

compositions of the positions in the

read and the base-call qualities. As

we discussed earlier in this article

that htseq-count is a tool for RNA-

Seq data analysis. It counts for each

gene that how many aligned reads

overlap the sequence exons. Since it

is designed specifically to analyse

differential expression only reads

mapping unambiguously to a single

gene are considered and the reads

aligned to multiple positions or

overlapping with more than one gene

are discarded. In case of paired-end

data, htseq-count counts only the

fragment not the reads because the

two paired ends originating from

the same fragment provide only

evidence for one cDNA fragment and

should hence be counted only once.

In this way, HTSeq offers a

comprehensive solution to facilitate a

wide range of programming tasks

in HTS data analysis. For further

reading, click here.

Note:

An exhaustive list of references for

this article is available with the

author and is available on personal

request, for more details write

[email protected]

m

Bioinformatics Review | 8

Genetic

Algorithm:

Explanation

and Perl Code Tariq Abdullah

Image Credit: Stock Photos

“Genetic Algorithm was developed by John Holland. It use the concepts of Natural Selection and Genetic Inheritance and tries to mimic the biological evolution. It falls under the category of algorithms known as Evolutionary Algorithms . ”

hen it comes to

bioinformatics

algorithms, Genetic

algorithms top the list

of most used and talked about

algorithms in bioinformatics.

Understanding Genetic algorithm is

important not only because it helps

you to reduce computational time

taken to get result but also because

it is inspired by how nature works.

In this article, you will learn how

genetic algorithm works, the basic

concept behind it and we will also

write a program to illustrate the

concepts. You can skip the

explanation if you already know the

basic concepts of Genetic Algorithm

Genetic Algorithm was developed by

John Holland. It use the concepts of

Natural Selection and Genetic

Inheritance and tries to mimic the

biological evolution. It falls under the

category of algorithms known

asEvolutionary Algorithms. It can be

used to find solution to the hard

problems where we don’t know

much about the search space.

Let us understand how genetic

algorithm works. For this, let us

consider a cancer associated gene

expression matrix. This matrix

contains all the known genes found

in human being and their level of

expression.

For a given problem, the genetic

algorithm works by maintaining a set

of candidate solutions and then

applies three operators over them –

Selection, Recombination and

Mutation, which are collectively

known as stochastic operator.

Selection: In nature, if an

organism is adapted to the

environment, its population will

grow relative to its quality of

adaptation. This is referred to

as selection. It means if a

solution meets the conditional

constraints, it is replicated at a

rate which is proportional to

the relative quality.

Recombination: In nature, two

similar chromosomes of the

surviving individual exchange

genes during sexual

reproduction in a process

known as Crossing Over. In GA

we decompose two distinct

solutions and randomly mix

W

ALGORITHMS

Bioinformatics Review | 9

their parts to form novel

solutions

Mutation: Random changes in

an existing chromosome may

lead to some fitter individual.

This concept is utilized to

randomly perturbs a candidate

solution

1. produce an initial population of

individuals

2. evaluate the fitness of all

individuals

3. while termination condition not

met do

4. select fitter individuals for

reproduction

5. recombine between individuals

6. mutate individuals

7. evaluate the fitness of the

modified individuals

8. generate a new population

9. End while

Have a look at the Genetic Algoithm

illustrated in the diagram below to

understand it more clearly.

The program

We are going to implement the

Genetic Algorithm and write a

program in Perl for it. Although not

purely applicable to a real life

problem, but it should be sufficient

to familiarize you with Genetic

Algorithm.

Suppose that you had a set of Gene

expression data. The data is for all

25000 genes in the human genome

and you want to find out what are

the five values among all 25000

values whose sum can give you the

highest number.

For the purpose of this program we

will require four subroutines:

Generate: It will

generatechromosomes containi

ng 5 values(specified in variable

$GeneNumberConstraint)

selected at random at positions

Mutate: It mutates a

chromosome at random

position with a random value

less than specified in

$HighestMutationValue

Survival Check: It checks if the

newly formed chromosome is

viable. i.e. It has a value that is

upto a minimum specification.

(Checking for fitness)

Recombine: It will form new

combinations from existing

chromosome by crossing them

over with each other.

The Code

If you wish, you can download the

Perl code on

GitHubhttps://github.com/bioinform

aticsreview

/geneticalgorithm

So here is the final code

implementing Genetic Algorithm in

Perl:

$CurrentHighest=0;

@GeneExpressionData =

(1,3,8,5,2,4,46,6,78,7,9,

9

,0,1,1,1,5,59,9,97,7,6,5,

45

,4,3,23,2,22,2,2,4,5,5,6,

54);

@SolutionSpace = ();

$HighestMutationValue =

110;

$GeneNumberConstraint =

5;

$InitialThreshold = 10;

$genes = scalar

@GeneExpressionData;

@chromosome = ();

$sum = 0;

$steps= 10;

print "The Total Genes are:

$genes\n";

generate();

$steps = 10;

for($p=0;$p<=$steps;$p++){

generate();

SurvivalCheck();

mutate();

SurvivalCheck();

recombine();

SurvivalCheck();

Bioinformatics Review | 10

}

print "\n\n Genetic

Algorithm Result

\n\n\n\t\tHighest

Detected: $CurrentHighest

in $steps Steps\n\n";

sub mutate{

$randpos =

int(rand($gene));

$n =

int(rand($HighestMutation

Value));

$chromosome[$randpos] =

$n;

print "\n Mutation Took

Place in Chromosome

@chromosome ";

}

sub recombine

{ print

"\nRecombining\n\n";

@chromosome1 =

$SolutionSpace[int

rand($p)];

@chromosome2 =

$SolutionSpace[int

rand($p)];

print "Random Sequence

Chromosome from Solution

Space: @chromosome1 and

@chromosome2";

for($i=0;

$i<=$GeneNumberConstraint

/2; $i++){

my $random_number =

int(rand(3)) + 1;

$pos1 =

int(rand($GeneNumberConst

raint));

$pos1 =

int(rand($GeneNumberConst

raint));

$swap =

$chromosome1[$pos1];

$chromosome1[$pos1]

= $chromosome2[$pos2];

$chromosome2[$pos2]

= $swap;

}

print "The Recombination

led to @chromosome";

@chromosome = ();

@chromosome =

@chromosome1;

}

sub SurvivalCheck{

$sum = 0;

foreach $val

(@chromosome){

$sum += $val;

}

if($sum>$CurrentHighest){

$CurrentHighest = $sum;

push @SolutionSpace,

@chromosome;

print "\nIndividual is

alive! \nCurrent Highest

Expression:

$CurrentHighest";

return 1;

}

else{

print "\nSpecies Didn't

Survive! \n";

return 0;

}

}

sub generate{

@chromosome = ();

for($i=1;$i<=$GeneNumberC

onstraint;$i++){

$n = int(rand($genes));

push @chromosome,

$GeneExpressionData[$n];

$sum +=

$GeneExpressionData[$n];

}

print "\n\n\nGenerated

Chromosome: @chromosome

\n";

}

Thats all! Feel free to

comment and discuss if

you have any confusion.

Like this article? Share

it.. ha?

Bioinformatics Review | 11

MUSCLE v/s T-COFFEE :

An overview and different

aspects Muniba Faiza Image Credit: Google Images

“MU SCLE and T-COFFEE both are multiple sequence alignment tools and also helps to study the evolutionary relationships among the species .”

s I have discussed in my

earlier articles about the

multiple sequence

alignment (MSA) tools

(MUSCLE & T-COFFEE). Now in this

article, we will discuss different

aspects of these tools and which

one is more preferred over the

another. MUSCLE and T-COFFEE

both are multiple sequence

alignment tools and also helps to

study the evolutionary relationships

among the species.As I have already

explained the algorithms involved in

both the tools which are

comparable. During the alignment

using MUSCLE, it uses the UPGMA

tree construction method which

assumes that mutation occurs at the

constant rate. This may be a fact

which makes it different from other

tools.

On the positive side, MUSCLE is a

tool which is known for its speed

and accuracy on each of the four

benchmark test sets ( BAliBASE,

SABmark, SMART and PREFAB). It is

much faster than other MSA tools.

MUSCLE also uses a progressive

alignment which is iterated while it

gets a better SP score (explained in

“Basic concept of MSA” article).

T-COFFEE is an improvisation over

MUSCLE in the sense that it

combines both global and local

alignments which provides better

results and it also qualifies the four

benchmark tests. Second thing

which makes it better than other

tools is that it uses an optimization

method which provides the multiple

alignment that best fits in the input

library. T-COFFEE also uses

progressive alignment strategy

similar to MUSCLE, but unlike

MUSCLE, T-COFFEE uses Neighbor

Joining tree construction method

during alignment which corrects the

assumption of UPGMA method and

assumes that mutation never occurs

at a constant rate.

Let us take protein sequences of

‘Keratin’ protein of few species and

align them using both the tools and

construct the respective phylogeny

trees. In this example, I have taken

FASTA sequences of:Homo

sapiens (GI: 7717238) , Paralichthys

olivaceus (GI:

10716084), Pseudomonas

viridiflava (GI: 934022154)

andPseudomonas aeruginosa (GI:

856785229). The results are as

follows:

As we have seen both the trees are

slight different. The sequence

of Paralichthys olivaceus is placed

below to that of Homo sapiens, but

it is placed above in tree

constructed by T-COFFEE. Similarly,

this is case with other two species.

This is how MUSCLE & T-COFFEE are

different from each other.

A

TOOLS

Bioinformatics Review | 12

T-COFFEE is more preferred over

MUSCLE while aligning both closely

or distantly related species but

MUSCLE ia more suitable to align

distantly related species since it

uses global alignment only, but T-

COFFEE uses both.

Note:

An exhaustive list of references for

this article is available with the

author and is available on personal

request, for more details write

[email protected]

m.

Fig 1. Tree constructed using MUSCLE. Fig 2. Tree constructed using T-COFFEE.

Bioinformatics Review | 13

Active learning in

drug-target

interactions

Muniba Faiza

Image Credit: Google Images

“ Active learning is a powerful tool for drug discovery and development where it reduces the tedious process of performing a number of experiments which are required to produce s ignificant high-confidence predictions .”

Active learning is a kind of

machine learning. Basically in

active learning, a learning

algorithm is used to perform the

desired experiments to produce a

desired output.

Active learning is a powerful tool for

drug discovery and development

where it reduces the tedious

process of performing a number of

experiments which are required to

produce significant high-confidence

predictions. However, practically it

is difficult to decide when to stop

the experimentation process.

Therefore, if a reliable stopping

criteria is applied to the algorithm

reduces both time and cost of the

experimentation process.

The basic of active learning is having

good predictive models to guide

experimentation.

Active learning iteratively builds a

model for drug-target interactions.

Instead of relying on large training

data sets as performed manually,

the active learning procedure

increases the training set step wise.

Thus, the time and experimental

cost is reduced and it is only spent

on improving the model rather than

for the verification of a specific

model which even may not be the

desired outcome or suits the

specifications under consideration.

How active learning works?

Active learning is an iterative

process and is completed in four

steps:

1. Initialization

2. Model

3. Active learning algorithm

4. Accuracy measure of the

predicted output

The active learning strategy starts

with an initialization step in which

an interaction matrix for drug and

target is formed. With the help of

this matrix subset of known labels

for the the drug and target kernels

Kd and Kt respectively are provided.

A

CADD

Bioinformatics Review | 14

The model predicts the drug-target

interactions. Based on the obtained

predictions, the active learning

algorithm is applied to find new

experiments (labels) which will

improve the model according to the

requirements. Here, batchwise

learning is applied where a fixed

number of experiments is queried in

each training round and thereby

increases the size of experiments

(labels).

Each training round has a specific

time point and is measured by the

number of experiments performed.

For each time point the accuracy of

the model is predicted by using

various methods. The process is

stopped on some conditions, for

example, if a certain budget for

performing experiments is reached

or the predicted accuracy of the

model is high enough.

This is the basic idea for active

learning applied in drug-target

predictions. It saves a lot of time

and cost involved in performing

experiments in vitro. For further

reading click here

Note:

An exhaustive list of references for

this article is available with the

author and is available on personal

request, for more details write to

[email protected]

Bioinformatics Review | 15

.

Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and

never miss out on any of your favorite topics.

Log on to

www.bioinformaticsreview.com

Bioinformatics Review | 16