RegExp.ppt

Embed Size (px)

Citation preview

  • 8/10/2019 RegExp.ppt

    1/16

    Regular

    Expressions

    A simple and powerful way to

    match characters

    Laurent Falquet, EPFL March, 2005

    Swiss Institute of BioinformaticsSwiss EMBnet node

  • 8/10/2019 RegExp.ppt

    2/16

    Regular Expressions

    What is a regular expression?

    Literal (or normal characters)Alphanumeric

    abcABC0123...Punctuation-_ ,.;:=()/+ *%&{}[]?!$^|\"@#

    MetacharactersEx: ls *.java

    Flavorsawk, egrep, Emacs, grep, Perl,

    POSIX, Tcl, PROSITE!

  • 8/10/2019 RegExp.ppt

    3/16

    Example: PROSITE Patterns are regular

    expressions

    Pattern:

  • 8/10/2019 RegExp.ppt

    4/16

    Regular Expressions (1)

    In Perl: // Start and End of line

    ^start, $end

    Match any of several

    []or (|) Match 0, 1 or more

    .1 of any

    ?0 or 1

    +1 or more

    *0 or more

    {m,n}range

    !negation

    ExamplesMatch every instance of aSwissProt AC

    m/[OPQ][0-9][A-Z0-9]{3}[0-9] /;

    m/[OPQ]\d[A-Z0-9]{3}\d/;

    Match every instance of aSwissProt ID

    m/[A-Z0-9]{2,5}_[A-Z0-9]{3,5}/;

  • 8/10/2019 RegExp.ppt

    5/16

    Regular Expressions (2)

    Escape character or backreference \char or \num

    Shorthand \ddigit [0-9]

    \s whitespace [space\f\n\r\t]

    \wcharacter [a-zA-Z0-9_]

    \D\S\Wcomplement of \d\s\w

    Byte notation \num character in octal

    \xnumcharacter inhexadecimal

    \ccharcontrol character

    Match operator m//

    $var =~ m/colou?r/;

    $var !~ m/colou?r/;

    Substitution operator s/// $var =~ s/colou?r/couleur/;

    Translate operator tr///

    $revcomp =~ tr/ACGT/tgca/;

    Modifiers //# /i case insensitive /g global match

    Many other /s,/m,/o,/x...

  • 8/10/2019 RegExp.ppt

    6/16

    Regular Expressions (3)

    Grouping External reference

    $var =~ s/sp\:(\w\d{5})/swissprot AC=$1/;

    Internal reference

    $var =~ s/tr\:(\w\d{5})\|\1/trembl AC=$1/;

    Numbering

    $1 to $9

    $10 to more if needed...

    Exercises Create a regexp to recognize

    any pseudo IP address:012.345.678.912

    Create a regexp to recognizeany email address:[email protected]

    Create a regexp to change anyHTML tag to another

    -> On sib-dea:

    use visual_regexp-1.2.tclto check

    your regular expressions(requires X-windows)

  • 8/10/2019 RegExp.ppt

    7/16

    Regular Expressions (4)

  • 8/10/2019 RegExp.ppt

    8/16

    Solution RegExp

    /[\d{1,3}\.]{3}\d{1,3}/

    /\w+\.\w+\@\w+\-?\w+\.[a-z]{2,4}/

    /\/\/

    generalized: address = \w+

  • 8/10/2019 RegExp.ppt

    9/16

    Perl In-liners

  • 8/10/2019 RegExp.ppt

    10/16

    In-liners: some options

    -a autosplit (only with -n or -p)

    -c check syntax

    -d debugger

    -e pass script lines -h help

    -i direct editing of a file

    -n loop without print

    -p loop with print -v version

    Example:

    perl -e 'print qq(hello world\n);'

  • 8/10/2019 RegExp.ppt

    11/16

    In-liners: -n and -p

    perl -pe s/\r/\n/g

    is equivalent to:

    open READ, file;while () {

    s/\r/\n/g;

    print;

    }

    close(READ);

    perl -i -pe s/\r/\n/g

    Warning:the -i optionmodifies the file directly

    perl -ne is the same

    without the print

  • 8/10/2019 RegExp.ppt

    12/16

    In-liners: -a (only with -n or -p)

    perl -ane print @F, \n;

    is equivalent to:open READ, file;

    while () {

    @F = split( );

    print @F, \n;

    }

    close(READ);

    Example:

    hits -b 'sw' -o pff2

    prf:CARD | perl -ane'print join("\t",reverse(@F)),"\n";'

  • 8/10/2019 RegExp.ppt

    13/16

    In-liners: -a (only with -n or -p)

    sw:ICEA_XENLA 1 90 prf:CARD 5 -1 18.553

    sw:RIK2_MOUSE 435 513 prf:CARD 5 -11 15.058

    sw:CARC_HUMAN 1 88 prf:CARD 6 -1 15.395

    sw:NAL1_HUMAN 1380 1463 prf:CARD 7 -1 15.058

    sw:ASC_HUMAN 113 195 prf:CARD 7 -2 15.374

    sw:CAR8_HUMAN 347 430 prf:CARD 8 -1 18.343

    sw:CARF_HUMAN 134 218 prf:CARD 9 -1 12.932

    hits -b 'sw' -o pff2 prf:CARD |perl -ane 'print join("\t", reverse(@F)),"\n";'

    hits -b 'sw' -o pff2 prf:CARD

    18.553 -1 5 prf:CARD 90 1 sw:ICEA_XENLA

    15.058 -11 5 prf:CARD 513 435 sw:RIK2_MOUSE

    15.395 -1 6 prf:CARD 88 1 sw:CARC_HUMAN

    15.058 -1 7 prf:CARD 1463 1380 sw:NAL1_HUMAN

    15.374 -2 7 prf:CARD 195 113 sw:ASC_HUMAN

    18.343 -1 8 prf:CARD 430 347 sw:CAR8_HUMAN

    12.932 -1 9 prf:CARD 218 134 sw:CARF_HUMAN

  • 8/10/2019 RegExp.ppt

    14/16

    In-liners: examples

    perl -e print int(rand(100)),"\n" for 1..100' | perl -e'$x{$_}=1 while ;print sort {$a$b} keys %x'

    for($i=0;$i

  • 8/10/2019 RegExp.ppt

    15/16

    In-liners: extract FASTA from SP

    cat /db/proteome/ECOLI.dat | perl -ne if (/^ID +(\w+)/){print">$1\n";} if(/^ /) {s/ //g; print}

    open (READ, /db/proteome/ECOLI.dat); # open file

    while ($line=) { # read line by line until the end

    if($line=~ /^ID +(\w+)/) { print >$1\n; } # print fasta header

    if($line=~ /^ /) {

    $line =~ s/ //g; # remove spacesprint $line; # print sequence line

    }

    }

    close(READ);

  • 8/10/2019 RegExp.ppt

    16/16

    In-liners: your turn

    Create an In-liner that extracts non-redundant FASTA formatsequences from a redundant database in SwissProt format

    cat /db/proteome/ECOLI.dat | perl -ne ' if (/^ID +(\w+)/){print ">$1\n;} if(/^ /) {s/ //g; print}' | perl -e 'while()

    { if (/>/) { $i=$_; $x{$i}=""} $x{$i}.=$_} print values

    %x