SGD

SGD Help: PSI-BLAST Results


Contents


Description

The PSI-BLAST Results uses the protein sequence of every ORF in S. cerevisiae as the query sequence for PSI-BLAST analysis against UniProt's UniRef90 protein dataset. Prior to January 2005, this resource displayed the results using NCBI's non-redundant (nr) protein dataset.

The results of this automated query are displayed as a graphical summary. In addition, multiple options are provided to view the alignment of the S. cerevisiae protein sequence with sequences from a specific taxonomy group, from a specific species, or from the top ten best hits in the dataset.

The data in this resource are updated quarterly with the most recent dataset available at that time. The date of the last update as well as the release version used for the analysis is posted at the top of the page. The alignments, however, are generated as requested.

UniRef90 dataset

The UniRef90 dataset clusters protein sequences (with 11 or more residues) that have at least 90% sequence identity to each other into a single representative sequence. The cluster is then identified by a single representative sequence.

The UniProt accession number of this representative sequence is hyperlinked to its UniProt record. This sequence is the representative sequence from the UniRef90 dataset. To see the entire cluster, click on the "UniRef 90% identity hyperlink" found on the UniProt page for that sequence.

With every release of the UniRef90 dataset, the members of a cluster as well as the representative sequence for that cluster may change. Please refer to the version of the UniRef90 used in the dataset to identify the correct UniRef90 cluster used in the PSI-BLAST results.

PSI-BLAST

PSI-BLAST is an iterative search that is very good at identifying broadly related protein families. For the PSI-BLAST Results, the default parameters of PSI-BLAST are used for 5 iterations. As sequence hits accumulate in each iteration, the query is reconstructed using all sequences identified. Therefore, with each iteration, the query results may become less identical to the original query sequence.

Due to the number of sequences that are available for certain proteins, such as actin and ribosomal proteins, there is a maximum of 250 sequences that is available for any given ORF. Therefore, some results may seem counter-intuitive; for example, in a large protein family where fungal sequences diverge from sequences in other organisms, the fungal sequence no longer the best hits regardless of the fact that the query was started with a fungal sequence. For more information, please see the PSI-BLAST paper.

PSI-BLAST vs. BLASTP

PSI-BLAST answers the question "What are the set of sequences that are most similar to a query set of sequences?" BLASTP answers the question "What are the most similar sequences to a particular query sequence?" Therefore the results will be different.

Although the PSI-BLAST Results presents the results of a PSI-BLAST analysis, there are several way to obtain the results from a BLASTP analysis. There is a "BLASTP at NCBI" option located in the Sequence Analysis Tools pull down menu on the Locus Page. This option will automatically paste the protein sequence of the ORF in the BLASTP query box at NCBI. BLASTP, in addition to other BLAST options are also available at NCBI.

Graphical summary

If the results contain hits in the UniRef90 dataset (excluding the query protein itself), an overview of the results is presented as a graphical summary at the top of the page.

Dendrogram

A dendrogram is generated by ClustalW with all representative sequences that are hits in the UniRef90 dataset. A thumbnail of the dendrogram illustrating the similarity of the sequences is displayed on the left. A larger version can be viewed by clicking on the thumbnail. Each sequence is identified by its UniProt accession number and the species name. In addition, the sequence name is color-coded based on its taxonomic distribution. The legend for the color-coding is located at the top and/or bottom of the figure.

Taxonomy distribution

The taxonomic distribution of the hits is displayed as a bar graph on the right. The graph is in log scale. The color-coding of the bars corresponds to the color-coding of the sequence name in the dendrogram. Clicking on any of the bars will generate an alignment page of the S. cerevisiae protein sequence with all sequences from the species in that taxonomic grouping.

The groups were determined by SGD curators based on the NCBI taxonomy database.

Align by best hits

The "Align by best hits" section lists the ten most significant hits, based on E-value, to the S. cerevisiae protein sequence.

Sequence information

The table contains the following information for each sequence:

To view ClustalW alignments, select the sequence(s) of interest by clicking on the checkbox and select an alignment preference.

Viewing sequences

Once the sequences have been chosen, an option to view the sequences must be selected. The Pairwise with S. cerevisiae option will align each selected sequence to the S. cerevisiae protein. The Align all sequences option will align all selected sequences together. The Download in FASTA format option will download a file containing all selected sequences in FASTA format.

Align by species

The "Align by species" section lists the species and the number of sequences from that species that have sequence similarity to the S. cerevisiae protein. Specific species can be selected in order to align sequences from that species to the S. cerevisiae protein.

Step 1: Select species

The box on the left side lists all species that contain significant similarity to the query protein. The number of similar sequences identified in a given species is indicated in parentheses following the species name.

To include a species in the alignment, click on the species name in the box on the left to highlight it, and then click the "Add" button to move the selected species to the box on the right. All the species can be included in the alignment by clicking on the "Add All" button.

The species list in either box can be sorted alphabetically, ascending and descending, as well as by the number of hits, ascending and descending.

Step 2: Select parameters

The choices of sequences chosen for the alignment can be narrowed by entering an upper limit for the E-value or by entering a minimum percentage of the S. cerevisiae protein that is aligned.

The E-value must be entered using scientific notation, such as 5e-15.

Step 3: Viewing sequences

After selecting the species and entering search parameters, an option to view the sequences must be selected. The Pairwise with S. cerevisiae option will align each selected sequence to the S. cerevisiae protein. The Align all sequences option will align all selected sequences together. The Download in FASTA format option will download a file containing all selected sequences in FASTA format.

Email notification

If more than a total of 50 sequences are selected, it may take up to an hour to generate the alignment. Therefore, an email address will be requested. The URL of the alignment page will be emailed to this address and will be available for two days after the alignments are generated. The email address will not be used for any other purpose.

Alignment page

The alignment page is generated based on the sequences that were chosen through the taxonomy distribution, align by best hits, or align by species option.

Dendrogram

A dendrogram is generated by ClustalW using the selected sequences only if the Align All Sequences alignment preference was chosen. If the Pairwise with S. cerevisiae alignment preference was chosen, no dendrogram will be generated.

A thumbnail of the dendrogram illustrating the similarity of the sequences is displayed on the left. A larger version can be viewed by clicking on the thumbnail. Each sequence is identified by its UniProt accession number and the species name. In addition, the sequence names are color-coded based on its taxonomic distribution. The legend for the color-coding is located at the top and/or bottom of the figure.

Organism/sequence list

At the top of the page, the species name(s) or sequence name(s), along with any search parameters that were included in the alignment, will be listed. There is also an option to download the sequences used in the alignment in FASTA format.

If the Pairwise with S. cerevisiae alignment preference was chosen from the "Align by best hits" option, the sequence name(s) will be hyperlinked to the pairwise alignment with that sequence. If the Pairwise with S. cerevisiae alignment preference was chosen from the "Align by species" option, the species name(s) will be hyperlinked to the pairwise alignments with sequences from that species.

ClustalW alignment

Sequences are aligned by ClustalW. Each row is labeled on the left side of the page with the UniProt accession number and species name for the target sequence or the S. cerevisiae gene name for the query sequence. The sequence name is hyperlinked to the UniProt record or to the SGD locus page if it is an S. cerevisiae gene.

Aligned sequences are displayed in rows of 60 residues or gaps. The residues are numbered independently for each sequence, on either side of each line. Note that although gaps (indicated with a dash) take up space on the line, they do not contribute to the residue count.

The symbols in the last row of the aligned sequences indicate the degree of sequence similarity. Note that all available species must be present for the similarity indicators to appear; this means that if one of the aligned sequences is missing at that position, that position will not be labeled with a symbol. See the Symbol Key section for an explanation of the similarity labeling.

Symbol key

The alignments are labeled with symbols (bottom row of alignment) indicating the degree of sequence similarity. The "strong" and "weak" similarity groups are determined using the Gonnet Pam250 matrix (strong similarity = score > 0.5; weak similarity = a positive score =< 0.5).

Similarity identical strong similarity weak similarity
Symbol * : .
Conserved Amino Acid Groups exact matches only The conserved position contains amino acids from one of the "strong" groups listed below (each row is a group):

                 STA
                 NEQK
                 NHQK
                 NDEQ
                 QHRK
                 MILV
                 MILF
                 HY
                 FYW
The conserved position contains amino acids from one of the "weak" groups listed below (each row is a group):

                 CSA
                 ATV
                 SAG
                 STNK
                 STPA
                 SGND
                 SNDEQK
                 NDEQHK
                 NEQHRK
                 FVLIM
                 HFY

Relevant links

Glossary terms


Return to Saccharomyces Genome Database Send a Message to the SGD Curators