Upload
vikash-singh
View
221
Download
0
Embed Size (px)
Citation preview
8/10/2019 RegExp.ppt
1/16
Regular
Expressions
A simple and powerful way to
match characters
Laurent Falquet, EPFL March, 2005
Swiss Institute of BioinformaticsSwiss EMBnet node
8/10/2019 RegExp.ppt
2/16
Regular Expressions
What is a regular expression?
Literal (or normal characters)Alphanumeric
abcABC0123...Punctuation-_ ,.;:=()/+ *%&{}[]?!$^|\"@#
MetacharactersEx: ls *.java
Flavorsawk, egrep, Emacs, grep, Perl,
POSIX, Tcl, PROSITE!
8/10/2019 RegExp.ppt
3/16
Example: PROSITE Patterns are regular
expressions
Pattern:
8/10/2019 RegExp.ppt
4/16
Regular Expressions (1)
In Perl: // Start and End of line
^start, $end
Match any of several
[]or (|) Match 0, 1 or more
.1 of any
?0 or 1
+1 or more
*0 or more
{m,n}range
!negation
ExamplesMatch every instance of aSwissProt AC
m/[OPQ][0-9][A-Z0-9]{3}[0-9] /;
m/[OPQ]\d[A-Z0-9]{3}\d/;
Match every instance of aSwissProt ID
m/[A-Z0-9]{2,5}_[A-Z0-9]{3,5}/;
8/10/2019 RegExp.ppt
5/16
Regular Expressions (2)
Escape character or backreference \char or \num
Shorthand \ddigit [0-9]
\s whitespace [space\f\n\r\t]
\wcharacter [a-zA-Z0-9_]
\D\S\Wcomplement of \d\s\w
Byte notation \num character in octal
\xnumcharacter inhexadecimal
\ccharcontrol character
Match operator m//
$var =~ m/colou?r/;
$var !~ m/colou?r/;
Substitution operator s/// $var =~ s/colou?r/couleur/;
Translate operator tr///
$revcomp =~ tr/ACGT/tgca/;
Modifiers //# /i case insensitive /g global match
Many other /s,/m,/o,/x...
8/10/2019 RegExp.ppt
6/16
Regular Expressions (3)
Grouping External reference
$var =~ s/sp\:(\w\d{5})/swissprot AC=$1/;
Internal reference
$var =~ s/tr\:(\w\d{5})\|\1/trembl AC=$1/;
Numbering
$1 to $9
$10 to more if needed...
Exercises Create a regexp to recognize
any pseudo IP address:012.345.678.912
Create a regexp to recognizeany email address:[email protected]
Create a regexp to change anyHTML tag to another
-> On sib-dea:
use visual_regexp-1.2.tclto check
your regular expressions(requires X-windows)
8/10/2019 RegExp.ppt
7/16
Regular Expressions (4)
8/10/2019 RegExp.ppt
8/16
Solution RegExp
/[\d{1,3}\.]{3}\d{1,3}/
/\w+\.\w+\@\w+\-?\w+\.[a-z]{2,4}/
/\/\/
generalized: address = \w+
8/10/2019 RegExp.ppt
9/16
Perl In-liners
8/10/2019 RegExp.ppt
10/16
In-liners: some options
-a autosplit (only with -n or -p)
-c check syntax
-d debugger
-e pass script lines -h help
-i direct editing of a file
-n loop without print
-p loop with print -v version
Example:
perl -e 'print qq(hello world\n);'
8/10/2019 RegExp.ppt
11/16
In-liners: -n and -p
perl -pe s/\r/\n/g
is equivalent to:
open READ, file;while () {
s/\r/\n/g;
print;
}
close(READ);
perl -i -pe s/\r/\n/g
Warning:the -i optionmodifies the file directly
perl -ne is the same
without the print
8/10/2019 RegExp.ppt
12/16
In-liners: -a (only with -n or -p)
perl -ane print @F, \n;
is equivalent to:open READ, file;
while () {
@F = split( );
print @F, \n;
}
close(READ);
Example:
hits -b 'sw' -o pff2
prf:CARD | perl -ane'print join("\t",reverse(@F)),"\n";'
8/10/2019 RegExp.ppt
13/16
In-liners: -a (only with -n or -p)
sw:ICEA_XENLA 1 90 prf:CARD 5 -1 18.553
sw:RIK2_MOUSE 435 513 prf:CARD 5 -11 15.058
sw:CARC_HUMAN 1 88 prf:CARD 6 -1 15.395
sw:NAL1_HUMAN 1380 1463 prf:CARD 7 -1 15.058
sw:ASC_HUMAN 113 195 prf:CARD 7 -2 15.374
sw:CAR8_HUMAN 347 430 prf:CARD 8 -1 18.343
sw:CARF_HUMAN 134 218 prf:CARD 9 -1 12.932
hits -b 'sw' -o pff2 prf:CARD |perl -ane 'print join("\t", reverse(@F)),"\n";'
hits -b 'sw' -o pff2 prf:CARD
18.553 -1 5 prf:CARD 90 1 sw:ICEA_XENLA
15.058 -11 5 prf:CARD 513 435 sw:RIK2_MOUSE
15.395 -1 6 prf:CARD 88 1 sw:CARC_HUMAN
15.058 -1 7 prf:CARD 1463 1380 sw:NAL1_HUMAN
15.374 -2 7 prf:CARD 195 113 sw:ASC_HUMAN
18.343 -1 8 prf:CARD 430 347 sw:CAR8_HUMAN
12.932 -1 9 prf:CARD 218 134 sw:CARF_HUMAN
8/10/2019 RegExp.ppt
14/16
In-liners: examples
perl -e print int(rand(100)),"\n" for 1..100' | perl -e'$x{$_}=1 while ;print sort {$a$b} keys %x'
for($i=0;$i
8/10/2019 RegExp.ppt
15/16
In-liners: extract FASTA from SP
cat /db/proteome/ECOLI.dat | perl -ne if (/^ID +(\w+)/){print">$1\n";} if(/^ /) {s/ //g; print}
open (READ, /db/proteome/ECOLI.dat); # open file
while ($line=) { # read line by line until the end
if($line=~ /^ID +(\w+)/) { print >$1\n; } # print fasta header
if($line=~ /^ /) {
$line =~ s/ //g; # remove spacesprint $line; # print sequence line
}
}
close(READ);
8/10/2019 RegExp.ppt
16/16
In-liners: your turn
Create an In-liner that extracts non-redundant FASTA formatsequences from a redundant database in SwissProt format
cat /db/proteome/ECOLI.dat | perl -ne ' if (/^ID +(\w+)/){print ">$1\n;} if(/^ /) {s/ //g; print}' | perl -e 'while()
{ if (/>/) { $i=$_; $x{$i}=""} $x{$i}.=$_} print values
%x