Next: Using HMMer to find
Up: Regular Expressions, Profiles and
Previous: UNIX Regular Expressions
You can use UNIX regular expressions on sequences too. The only trick is
that you want to match entire sequences as strings and return the
lines that match them, but sequences in multifasta files are broken on
multiple lines. Also, their identifiers are on a different line, so
matching the sequence doesn't get you the name of the sequence that
matches. It is a simple matter to fix these problems, though. There
are many programs to match sequences with regular expressions. One is
called subfasta. You will practice using regular expressions to
solve routine bioinformatic tasks on the E. coli genome, and
learn a little more about the advantages and disadvantages of motifs.
- Download
subfasta
and protein
sequences from the E. coli genome.
- Also Download the annotated protein
descriptions from the E. coli
genome.
- Look at the genome proteins fasta file.
% less NC_000913.ffp
press 'q' to quit.
- Pull out all the sequence description lines into less.
% grep \> NC_000913.ffp | less
OBS: very important that you quote '\
' the
angle bracket '>' from the shell!
with the backslash '\
'. Otherwise bad things
can happen!
e.g. on't do this: grep >
NC_000913.ffp or you'll erase the file and have to download it again!
- Find all E. coli proteins that match the P-loop
motif from the lecture:
[GA]x(4)GK[ST]
using
subfasta. Try to do this from the help:
% subfasta -h
Hint: '.'
matches any character and the {N}
quantifier matches any number of times.
First print to screen a set of sequences, then save to file ploop.fas.
- Count the matches. First take a look with
% grep \> ploop.fas | less
Type 'q' to quit out of less, space to page down. Then count the number of matches with;
% grep \> ploop.fas | wc
which number is it? if you don't know, try:
% man wc
- Extract the unique IDs. Now we are going to use the unique
ids from the ploop.fas to get more information about the
proteins. The unique IDs are the strings like b0001 and are
only unique for the E. coli genome. We'll cut this part out
and reshape it to get all the matching lines in the protein
description file. From this we'll try to ask how many ATP/GTP
proteins the match missed (false negatives), and also how many
non-ATP/GTP proteins the motif matched (false positives).
First
look at that file with
% less NC_00913.ptt
Do you see the unique IDs with 'b' there? Notice also that each
description is on one line, one line per protein. This kind of
tab-delimited format, one entry per line in simple text is perfect
for quick analysis with UNIX tools. It's critical for what's ahead
in this exercise. Press 'q' to quit.
Now try to cut out the ids from the matching
sequence file in ploop.fas:
% grep \> ploop.fas | cut -f2 -d' ' | less
type 'q' to quit out of less.
Now you'll paste all those ids together with a little perl trick
that you don't have to understand. With this you will make a
regular expression to use with grep on the ptt file
% grep \> ploop.fas | cut -f2 -d' ' | perl -ane 's/\n/|/; print if (/b/);'
Hint: just cut and paste the above line from your web-page
to the terminal.
You should get back a very long unbroken line of identifiers
starting with 'b' and separated by pipes '|'.
- Get descriptions of all proteins that matched the motif.
Cut and paste the output of the last command (all the b00xx's
separated by |s, and leave off the last |), call it xxx, and
make it an argument to
egrep:
% egrep 'xxx' NC_000913.ptt | less
Make sure you use the quotes '!
- Count the matches and
- Save to the file ploop.ptt. Use the tools you used above and the
up-arrow key to bring the last command back.
- Find and count all descriptions in NC_000913.ptt that match the
``ATP'' or ``GTP.''
- Save those to file AGTP.ptt.
- Count how many descriptions are in the combined files.
% cat ploop.ptt AGTP.ptt | wc
Now we'll see how many proteins have ATP or GTP in the description
that didn't match the motif. We'll do this by sorting the
concatenation of the two files so that duplicate lines are next to
each other. Try it, you 'll see what I mean
% cat ploop.ptt AGTP.ptt | sort | less
Did you see duplicates? Now you can remove duplicates and count
them with:
% cat ploop.ptt AGTP.ptt | sort | uniq | wc
The excess over the number in motif.ptt is the number of
proteins that had ATP or GTP in the description
but didn't match the motif. What is this number?
To find proteins that matched the motif but don't contain ATP or
GTP in the description, use the negation option -v to grep
on motif.ptt. What is this number?
Next: Using HMMer to find
Up: Regular Expressions, Profiles and
Previous: UNIX Regular Expressions
David Ardell
2005-01-31