next up previous
Next: Using HMMer to find Up: Regular Expressions, Profiles and Previous: UNIX Regular Expressions

Regular expressions on sequences

You can use UNIX regular expressions on sequences too. The only trick is that you want to match entire sequences as strings and return the lines that match them, but sequences in multifasta files are broken on multiple lines. Also, their identifiers are on a different line, so matching the sequence doesn't get you the name of the sequence that matches. It is a simple matter to fix these problems, though. There are many programs to match sequences with regular expressions. One is called subfasta. You will practice using regular expressions to solve routine bioinformatic tasks on the E. coli genome, and learn a little more about the advantages and disadvantages of motifs.

  1. Download subfasta and protein sequences from the E. coli genome.

  2. Also Download the annotated protein descriptions from the E. coli genome.

  3. Look at the genome proteins fasta file.

    % less NC_000913.ffp

    press 'q' to quit.

  4. Pull out all the sequence description lines into less.

    % grep \> NC_000913.ffp | less

    OBS: very important that you quote '\' the angle bracket '>' from the shell!
    with the backslash '\'. Otherwise bad things can happen!
    e.g. on't do this: grep > NC_000913.ffp or you'll erase the file and have to download it again!

  5. Find all E. coli proteins that match the P-loop motif from the lecture: [GA]x(4)GK[ST] using subfasta. Try to do this from the help:

    % subfasta -h

    Hint: '.' matches any character and the {N} quantifier matches any number of times.
    First print to screen a set of sequences, then save to file ploop.fas.

  6. Count the matches. First take a look with

    % grep \> ploop.fas | less

    Type 'q' to quit out of less, space to page down. Then count the number of matches with;

    % grep \> ploop.fas | wc

    which number is it? if you don't know, try:

    % man wc

  7. Extract the unique IDs. Now we are going to use the unique ids from the ploop.fas to get more information about the proteins. The unique IDs are the strings like b0001 and are only unique for the E. coli genome. We'll cut this part out and reshape it to get all the matching lines in the protein description file. From this we'll try to ask how many ATP/GTP proteins the match missed (false negatives), and also how many non-ATP/GTP proteins the motif matched (false positives).
    First look at that file with

    % less NC_00913.ptt

    Do you see the unique IDs with 'b' there? Notice also that each description is on one line, one line per protein. This kind of tab-delimited format, one entry per line in simple text is perfect for quick analysis with UNIX tools. It's critical for what's ahead in this exercise. Press 'q' to quit.

    Now try to cut out the ids from the matching sequence file in ploop.fas:
    % grep \> ploop.fas | cut -f2 -d' ' | less
    type 'q' to quit out of less.
    Now you'll paste all those ids together with a little perl trick that you don't have to understand. With this you will make a regular expression to use with grep on the ptt file
    % grep \> ploop.fas | cut -f2 -d' ' | perl -ane 's/\n/|/; print if (/b/);'
    Hint: just cut and paste the above line from your web-page to the terminal.
    You should get back a very long unbroken line of identifiers starting with 'b' and separated by pipes '|'.
  8. Get descriptions of all proteins that matched the motif. Cut and paste the output of the last command (all the b00xx's separated by |s, and leave off the last |), call it xxx, and   make it an argument to egrep:
    % egrep 'xxx' NC_000913.ptt | less
    Make sure you use the quotes '!
  9. Count the matches and
  10. Save to the file ploop.ptt. Use the tools you used above and the up-arrow key to bring the last command back.
  11. Find and count all descriptions in NC_000913.ptt that match the ``ATP'' or ``GTP.''
  12. Save those to file AGTP.ptt.
  13. Count how many descriptions are in the combined files.
    % cat ploop.ptt AGTP.ptt | wc
    Now we'll see how many proteins have ATP or GTP in the description that didn't match the motif. We'll do this by sorting the concatenation of the two files so that duplicate lines are next to each other. Try it, you 'll see what I mean
    % cat ploop.ptt AGTP.ptt | sort | less
    Did you see duplicates? Now you can remove duplicates and count them with:
    % cat ploop.ptt AGTP.ptt | sort | uniq | wc
    The excess over the number in motif.ptt is the number of proteins that had ATP or GTP in the description but didn't match the motif. What is this number?
    To find proteins that matched the motif but don't contain ATP or GTP in the description, use the negation option -v to grep on motif.ptt. What is this number?

next up previous
Next: Using HMMer to find Up: Regular Expressions, Profiles and Previous: UNIX Regular Expressions
David Ardell 2005-01-31