Contents
The results of this automated query are displayed as a graphical summary. In addition, multiple options are provided to view the alignment of the S. cerevisiae protein sequence with sequences from a specific taxonomy group, from a specific species, or from the top ten best hits in the dataset.
The data in this resource are updated quarterly with the most recent dataset available at that time. The date of the last update as well as the release version used for the analysis is posted at the top of the page. The alignments, however, are generated as requested.
UniRef90 dataset
The UniRef90 dataset clusters protein sequences (with 11 or more residues) that have at least 90% sequence identity to each other into a single representative sequence. The cluster is then identified by a single representative sequence.The UniProt accession number of this representative sequence is hyperlinked to its UniProt record. This sequence is the representative sequence from the UniRef90 dataset. To see the entire cluster, click on the "UniRef 90% identity hyperlink" found on the UniProt page for that sequence.
With every release of the UniRef90 dataset, the members of a cluster as well as the representative sequence for that cluster may change. Please refer to the version of the UniRef90 used in the dataset to identify the correct UniRef90 cluster used in the PSI-BLAST results.
PSI-BLAST
PSI-BLAST is an iterative search that is very good at identifying broadly related protein families. For the PSI-BLAST Results, the default parameters of PSI-BLAST are used for 5 iterations. As sequence hits accumulate in each iteration, the query is reconstructed using all sequences identified. Therefore, with each iteration, the query results may become less identical to the original query sequence.Due to the number of sequences that are available for certain proteins, such as actin and ribosomal proteins, there is a maximum of 250 sequences that is available for any given ORF. Therefore, some results may seem counter-intuitive; for example, in a large protein family where fungal sequences diverge from sequences in other organisms, the fungal sequence no longer the best hits regardless of the fact that the query was started with a fungal sequence. For more information, please see the PSI-BLAST paper.
PSI-BLAST vs. BLASTP
PSI-BLAST answers the question "What are the set of sequences that are most similar to a query set of sequences?" BLASTP answers the question "What are the most similar sequences to a particular query sequence?" Therefore the results will be different.Although the PSI-BLAST Results presents the results of a PSI-BLAST analysis, there are several way to obtain the results from a BLASTP analysis. There is a "BLASTP at NCBI" option located in the Sequence Analysis Tools pull down menu on the Locus Page. This option will automatically paste the protein sequence of the ORF in the BLASTP query box at NCBI. BLASTP, in addition to other BLAST options are also available at NCBI.
If the results contain hits in the UniRef90 dataset (excluding the query protein itself), an overview of the results is presented as a graphical summary at the top of the page.
Dendrogram
A dendrogram is generated by ClustalW with all representative sequences that are hits in the UniRef90 dataset. A thumbnail of the dendrogram illustrating the similarity of the sequences is displayed on the left. A larger version can be viewed by clicking on the thumbnail. Each sequence is identified by its UniProt accession number and the species name. In addition, the sequence name is color-coded based on its taxonomic distribution. The legend for the color-coding is located at the top and/or bottom of the figure.Taxonomy distribution
The taxonomic distribution of the hits is displayed as a bar graph on the right. The graph is in log scale. The color-coding of the bars corresponds to the color-coding of the sequence name in the dendrogram. Clicking on any of the bars will generate an alignment page of the S. cerevisiae protein sequence with all sequences from the species in that taxonomic grouping.The groups were determined by SGD curators based on the NCBI taxonomy database.
Sequence information
The table contains the following information for each sequence:
- UniProt accession number
- organism name
- description of the sequence from the UniProt record
- the reported E-value from the PSI-BLAST query
- the percentage of the S. cerevisiae protein that aligned with the target sequence
To view ClustalW alignments, select the sequence(s) of interest by clicking on the checkbox and select an alignment preference.
Viewing sequences
Once the sequences have been chosen, an option to view the sequences must be selected. The Pairwise with S. cerevisiae option will align each selected sequence to the S. cerevisiae protein. The Align all sequences option will align all selected sequences together. The Download in FASTA format option will download a file containing all selected sequences in FASTA format.
Step 1: Select species
The box on the left side lists all species that contain significant similarity to the query protein. The number of similar sequences identified in a given species is indicated in parentheses following the species name.To include a species in the alignment, click on the species name in the box on the left to highlight it, and then click the "Add" button to move the selected species to the box on the right. All the species can be included in the alignment by clicking on the "Add All" button.
The species list in either box can be sorted alphabetically, ascending and descending, as well as by the number of hits, ascending and descending.
Step 2: Select parameters
The choices of sequences chosen for the alignment can be narrowed by entering an upper limit for the E-value or by entering a minimum percentage of the S. cerevisiae protein that is aligned.The E-value must be entered using scientific notation, such as 5e-15.
Step 3: Viewing sequences
After selecting the species and entering search parameters, an option to view the sequences must be selected. The Pairwise with S. cerevisiae option will align each selected sequence to the S. cerevisiae protein. The Align all sequences option will align all selected sequences together. The Download in FASTA format option will download a file containing all selected sequences in FASTA format.
Dendrogram
A dendrogram is generated by ClustalW using the selected sequences only if the Align All Sequences alignment preference was chosen. If the Pairwise with S. cerevisiae alignment preference was chosen, no dendrogram will be generated.
A thumbnail of the dendrogram illustrating the similarity of the sequences is displayed on the left. A larger version can be viewed by clicking on the thumbnail. Each sequence is identified by its UniProt accession number and the species name. In addition, the sequence names are color-coded based on its taxonomic distribution. The legend for the color-coding is located at the top and/or bottom of the figure.
Organism/sequence list
At the top of the page, the species name(s) or sequence name(s), along with any search parameters that were included in the alignment, will be listed. There is also an option to download the sequences used in the alignment in FASTA format.If the Pairwise with S. cerevisiae alignment preference was chosen from the "Align by best hits" option, the sequence name(s) will be hyperlinked to the pairwise alignment with that sequence. If the Pairwise with S. cerevisiae alignment preference was chosen from the "Align by species" option, the species name(s) will be hyperlinked to the pairwise alignments with sequences from that species.
ClustalW alignment
Sequences are aligned by ClustalW. Each row is labeled on the left side of the page with the UniProt accession number and species name for the target sequence or the S. cerevisiae gene name for the query sequence. The sequence name is hyperlinked to the UniProt record or to the SGD locus page if it is an S. cerevisiae gene.Aligned sequences are displayed in rows of 60 residues or gaps. The residues are numbered independently for each sequence, on either side of each line. Note that although gaps (indicated with a dash) take up space on the line, they do not contribute to the residue count.
The symbols in the last row of the aligned sequences indicate the degree of sequence similarity. Note that all available species must be present for the similarity indicators to appear; this means that if one of the aligned sequences is missing at that position, that position will not be labeled with a symbol. See the Symbol Key section for an explanation of the similarity labeling.
Symbol key
The alignments are labeled with symbols (bottom row of alignment) indicating the degree of sequence similarity. The "strong" and "weak" similarity groups are determined using the Gonnet Pam250 matrix (strong similarity = score > 0.5; weak similarity = a positive score =< 0.5).
Similarity identical strong similarity weak similarity Symbol * : . Conserved Amino Acid Groups exact matches only The conserved position contains amino acids from one of the "strong" groups listed below (each row is a group): STA NEQK NHQK NDEQ QHRK MILV MILF HY FYWThe conserved position contains amino acids from one of the "weak" groups listed below (each row is a group): CSA ATV SAG STNK STPA SGND SNDEQK NDEQHK NEQHRK FVLIM HFY
Return to Saccharomyces Genome Database |
Send a Message to the SGD Curators ![]() |