SGD Help: Pattern Matching
Contents
PatMatch permits the identification of patterns or motifs within
the collection of all S. cerevisiae protein or DNA
sequences. The pattern can be either a simple string or a regular
expression. Standard substitutions are allowed in the string, such as
using "R" for any purine base when performing a nucleotide
search. Pattern matching offers an alternative to sequence alignment
techniques such as BLAST and FASTA for identifying nucleotide or
peptide sequences with conserved or biologically interesting regions.
SGD offers a selection of sequence datasets that can be
searched, depending on the user's requirements.
- GenBank
is the subset of DNA sequences submitted to GenBank that have been
derived from S. cerevisiae DNA. It includes results of the
systematic sequencing as well as results from individual laboratories.
- genoSc
is the complete, up-to-date Saccharomyces cerevisiae genome
sequence maintained at SGD.
- ORF-Coding consists
of ORF sequences from the initial ATG to the stop codon but without
upstream or downstream sequences, intron sequences, or bases not
translated due to translational frameshifting. Contains all ORFs
except those classified as Dubious and pseudogenes.
- ORF-Genomic consists of ORF sequences from the initial ATG to the
stop codon including intron sequences and any bases not translated due
to translational frameshifting, but not including upstream or
downstream sequences. Contains all ORFs except those classified as Dubious and pseudogenes.
- ORF-Genomic-1000 consists of ORF sequences from the initial ATG to the
stop codon including intron sequences and any bases not translated due
to translational frameshifting, plus 1000 bp upstream and downstream
of each ORF. Contains all ORFs except those
classified as Dubious and pseudogenes.
- ORF-Trans is a dataset
containing protein translations of all systematically named ORFs
except those classified as Dubious and pseudogenes.
- NRSC is a non-redundant set
of S. cerevisiae protein sequences from GenBank. For
example, while there may be 10 GenBank entries
for a particular sequence, it will be represented only once in the NRSC.
- NotFeature includes those portions of the systematic sequence that
are not an ORF, ARS, centromere, rRNA gene, tRNA gene, snRNA gene,
snoRNA gene, LTR, telomeric element or Ty element.
Tips for Pattern Matching:
- The pattern may be lowercase or uppercase. There is no maximum or
minimum pattern size.
- A description of the allowed syntax of the pattern is provided at
the bottom of the Pattern Matching page.
- The Strand option
is used for restricting NUCLEOTIDE searches to only
one strand of the specified dataset. The default is that both strands are
searched. If the "Strand in dataset" option is chosen, then only the
strand that is actually present in the dataset will be searched. In
other words, if the chosen dataset is:
- GenBank -- only the sequences as entered in GenBank will be searched
- genoSC (complete S. cerevisiae genome) -- only the Watson strands of
the chromosomes will be searched
- ORF-coding, ORF-Genomic, and ORF-Genomic-1000 -- only
the Watson strand
will be searched for ORFs on the Watson strand (i.e. that end in a
"W"), and only the Crick strand will be searched for ORFs on the Crick strand (i.e. that end in a
"C")
Choosing "Reverse complement of strand in dataset" restricts the PatMatch
search to the reverse complement of the strands described above.
Please note that in the displayed sequence, only the Watson strand will be
shown, regardless of which strand option is chosen. If your pattern has a
match on the Crick strand, the reverse complement of the pattern will be
highlighted in the Watson sequence.
- The Mismatch, Deletion or Insertion options will permit matches to
sequences that contain a defined number of substitutions, deletions or insertions relative to the input pattern. This number can range from 1 to 3. At this time, patterns containing regular expressions do not support the mismatch, deletion and insertion options.
- When searching for patterns near the beginning or end of a sequence, bear in mind that nucleotide sequences will include the stop codon (TAA, TAG,
or TGA) and start codon (5' ATG). Peptide sequence will include the initiator methionine,
whether or not it is removed in vivo.
- If the genoSc, ORF-Coding, ORF-Genomic, ORF-Genomic-1000, ORF-Trans or
NotFeature dataset is searched, both the Chromosome Graphic
and Full Search Result Table are displayed. If the
GenBank or NRSC datasets are used, only the Full Search
Result Table is shown. The Chromosome Graphic displays all the hits in
the 16 yeast chromosomes; the user may click on any region in any
chromosome bar to go to the Features Map for viewing the hits. The
Full Search Result Table lists the name of sequences containing
a match, the hit number, matching pattern, matching
position, the link to a DNA or Protein sequence and any
information about the sequence. Matching position is given relative to
the entire sequence matched (listed in the Sequence Name column); the sequence may be an entire chromosome,
an ORF (DNA or amino acid sequence), or a region of untranslated DNA.
- The sequences with hits are listed in the table based on the number of
the hits and sequence name.
- At this time, PatMatch will not find overlapping hits.
If a PatMatch search results in no or few matches, the user may try to
increase the number of matches in a number of ways. Going back to the
PatMatch search page, the user can change the database searched (for example,
from genoSc to GenBank), use a less selective pattern, or increase the number of allowed mismatches, deletions or insertions.
Aborting a PatMatch Search
To abort a search, the user should click on the button labeled "Click
here to abort the search", which will actually stop the process
running on the SGD server. This is better than hitting the "Back"
button on the browser, since otherwise the SGD computer will continue
to process the search request.
PatMatch can be accessed:
- by selecting the "PatMatch" hypertext link on the tool bar at the top of most SGD
WWW pages.
- by selecting the "Pattern Matching" link in the Analysis & Tools contents page
Page
- Links within SGD
- BLAST Search Page
- FASTA Search Page
- External links
- GenBank
Go to PatMatch