SGD Help: Protein Information
The Protein page provides detailed information about the protein encoded by a particular gene. This page contains locus specific nomenclature and protein product information; a brief description of the role of the protein within the cell; the predicted primary protein sequence; detailed domain/motif information; and basic information derived from the protein sequence, including physico-chemical properties and other values. Links provide access to detailed prediction-based and manually curated referenced information and to various external resources.
- Contents of the Protein Page
- External Identifiers
The Protein Overview section contains several fields of nomenclature for the protein in question. If a field doesn't have a value, it will not be listed. Potential fields that may be listed under this heading include:
- Aliases: a list of all the alternative names in the literature given to the gene/protein.
- Gene Product: For protein coding genes this field contains the name of the protein.
- Protein Product: The name of the protein. For protein coding genes this field should be the same as the ''Gene Product' field above, or a synonym of it.
- Description: a concise summary of the biological role and molecular function of the protein and/or gene.
- Name Description: contains the expanded form of the standard name, as described in the literature.
- EC Number: the Enzyme Commission number, for known enzymes. The value is a link to a dedicated page within SGD for the EC number, with additional information and links.
- Paralog: if the protein has an identified paralog, a link is provided to the Locus Summary page of the paralog in SGD.
Currently contains the number of molecules/cell, calculated using GFP fusion proteins and quantitative western blot analysis by Ghaemmaghami et al. (2003), as displayed on the Yeast GFP Fusion Localization Database website.
The Domains and Classification section displays the results of domain predictions for yeast protein sequences using the InterProScan program (Jones P. et al. (2014)). InterProScan is a tool that combines different protein signature recognition methods into one resource. The Interpro database integrates motif, domain and protein family HMM information from the following member databases: Gene3D, PANTHER, Pfam, Phobius, PIR Superfamily, PRINTS, SignalP, SMART, Superfamily, TIGRFAM and TMHMM. The domain predictions are refreshed every 3 months, to keep them up-to-date. The predictions are shown both in tabular and graphical form.
- Domains Table: In the table, coordinates of each domain are shown, along with the accession ID, description of the domain, the source of the domain information and the number of yeast genes that contain that domain. The domain accession ID is linked to a page in SGD that provides additional info about the domain. The contents of the table can be downloaded as tab-delimited text file.
- Domain Locations Graph: Hovering the mouse above a domain on the graphical representation will show an info box with the description and precise coordinates of the domain within the protein.
- Shared Domains: This section of the page shows a network visualization by Cytoscape that shows yeast proteins (grey circles) that share domains (colored squares) with the selected protein (yellow circle). The visualization shows the proteins that share the largest number of domains with the central protein and is limited to show maximum 100 nodes and maximum 250 edges. The nodes of the graph are linked out to locus summary pages and domain pages within SGD.
This section of the page has several subsections that show sequence based information about the protein.
The amino acid sequence is displayed in 60-residue blocks. Residues are numbered on the left side. The sequence shown by default is that of the reference strain, but the pull-down menu allows selection of the sequence from any of the other cerevisiae strains whose sequence is available in SGD. Also included is a button to Download the sequence, which loads a flat-text browser page with the amino acid displayed in FASTA format. Known modification sites (currently phosphorylation) are highlighted on the sequence by color. (See next section for further information on modifications.)
If there is protein modification data available for the protein, this section of the page shows this information in a table format. Currently, only phosphorylation sites are available, but in the future this section will include additional modifications as well. Protein phosphorylation data is drawn from PhosphoGRID and Phosphopep databases. The phosphorylated residues are listed and the link under the source column is linked out to the entry at the external website. The phosphorylation sites shown are for the strain selected for the sequence display above - changing the selected strain may change the list phosphorylation sites. Since annotations at PhosphoGrid don't record strain information, they are assumed to be valid for each strain, unless the indicated residue is not present - mutated - in a given strain.
Data in this section are calculated from the protein sequence using BioPerl Seq libraries and CODONW software.
The Amino Acid Composition is based on the primary sequence. The table contains three columns: the first lists the one letter designations for the twenty amino acids, the second column lists the number of amino acids present in one molecule, and the third contains the composition expressed as a percentage.
This section contains various physico-chemical properties of the protein calculated from the sequence, including:
- Length (a.a.): the predicted full length of the translated gene product.
- Molecular Weight (Da): the predicted molecular weight of the full length protein in daltons (Da).
- Isoelectric Point (pI): the theoretical isoelectric point (pI) is the pH at which the protein carries no net charge.
- Formula: molecular formula of the protein.
- Instability Index
The instability index was developed based on a statistical analysis of 12 unstable and 32 stable proteins (Guruprasad et al., 1990). This analysis revealed the presence of certain dipeptides that occurred with significantly different frequencies between stable and unstable proteins. A dipeptide instability weight value (DIWV) was assigned to each of 400 different dipeptides. These weight values were then used to calculate an instability index (II) defined as:
i=L-1 II = (10/L) * Sum DIWV(x[i]y[i+1]) i=1 where: L is the length of sequence DIWV is the instability weight value and x[i]y[i+1] is a dipeptide starting at position i.
Proteins with an instability index less than 40 are predicted to be stable, whereas those with a value greater than 40 are predicted to be unstable.
- Aliphatic Index
The aliphatic index refers to the relative volume of a protein that is occupied by aliphatic side chains (alanine, isoleucine, leucine and valine) and contributes to the increased thermostability observed for globular proteins. The aliphatic index of a protein is calculated according to the following formula (Ikai, 1980):
Aliphatic index = X(Ala) + a * X(Val) + b * ( X(Ile) + X(Leu) ) where X(Ala), X(Val), X(Ile), and X(Leu) are mole percent (100 X mole fraction) of alanine, valine, isoleucine, and leucine. The coefficients a and b are the relative volume of valine side chains (a = 2.9) and of Leu/Ile side chains (b = 3.9) relative to that of alanine side chains.
Values for Codon Bias Index (CBI), Codon Adaptation Index (CAI), Frequency of Optimal Codons (Fop), Hydropathicity of Protein (GRAVY score), and Aromaticity Score (AROMO) are calculated based on the specific genetic code and codon usage of a given organism and organelle. These values were calculated using the CodonW software program written by John Peden.
CodonW analyzes the correspondence between amino acids and codon usage in a set of protein sequences, based on a given genetic code (i.e. that used in the S. cerevisiae nucleus versus that used in its mitochondrion). CodonW was designed to work with any genetic code. Decisions regarding whether an amino acid is synonymous or non-synonymous, the translation of a codon, the number of codons in a codon family, how many synonyms a codon has, are all determined at run time. Seven alternatives to the universal genetic code have been built in to the program, including S. cerevisiae chromosomal codon usage and S. cerevisiae mitochondrial codon usage. In SGD, we have used these two built-in options, as appropriate, to perform codon usage-based calculations for chromosomally-encoded or mitochondrially-encoded ORFs. Note that codon usage-based calculations are not currently performed for ORFs present within transposable elements (Ty elements), because the codon usage of transposable element genes differs from that of chromosomal genes (see the CodonW tutorial).
The extinction coefficient (epsilon) is the wavelength-dependent molar absorptivity coefficient with units of M-1 cm-1. The extinction coefficient provides an indication of the amount of light that a given protein will absorb at a certain wavelength (usually 280 nm). During protein purification a spectrophotometer can be used to follow the protein of interest if the extinction coefficient is known. The molar extinction coefficient of a protein can be estimated based on its amino acid composition. The extinction coefficient of the native protein in water can be calculated based on the molar extinction coefficient of tyrosine, tryptophan and cystine (cysteine does not absorb much at wavelengths greater than 260 nm while cystine does) using the following equation:
E(Prot) = Numb(Tyr)*Ext(Tyr) + Numb(Trp)*Ext(Trp) + Numb(Cystine)*Ext(Cystine) where: Ext(Tyr) = 1490 Ext(Trp) = 5500 Ext(Cystine) = 125
The absorbance (optical density) can then be calculated using the following formula:
Absorb(Prot) = E x l x C where: E = extinction coefficient l = pathlength (cm) C = protein concentration (M)
Two extinction coefficient values are calculated by ProtParam, the first value is based on the assumption that all cysteine residues appear as half cystines, and the second assumes that no cysteines appear as half cystines. The computation has been demonstrated to be quite reliable for proteins that contain Trp residues, but for proteins without Trp residues there may be more than a 10% error.
These calculations are based on the method developed by Edelhoch, 1967, using extinction coefficients for Trp and Tyr, as determined by Pace et al., 1995. The values used in the calculation of extinction coefficients for denatured proteins were also found to be accurate for calculating coefficients for the native protein (Gill and von Hippel, 1989). In general, since Trp residues contribute much more to the overall extinction coefficient than Tyr and cystine residues, the calculations tend to be much closer to measured values for proteins that contain Trp residues.
The Atomic Composition Table displays the composition of the protein, with respect to the number of atoms of carbon, hydrogen, nitrogen, oxygen, and sulfur that it contains as well as the total number of atoms and the resulting formula.
This section of the page provides access to a compendium of Saccharomyces cerevisiae sequence entries for alleles and strains that are located in various external databases including GenBank/EMBL/DDBJ, NCBI, EBI, and MIPS. Sequence entries are listed by accession and/or version numbers according to the source. Additional information is available in the All Associated Sequences help page.
This section provides access to a number of external resources relevant to the query protein. This includes sequence entries located at various homolog related resources, interaction databases, protein databases, and localization resources.
- Homologs: provide access to several sources of homolog information, when available for the requested protein.
- Ashbya (AGD): provides a direct link between the S. cerevisiae protein and the Ashbya gosspyii ortholog at the Ashbya Genome Database (AGD) located at the University of Basel.
- AspGD Homologs: provides a direct link between the S. cerevisiae protein and the Aspergillus nidulans ortholog at the Aspergillus Genome Database (AGD) located at Stanford University.
- CGD Homologs: provides a direct link between the S. cerevisiae protein and the Candida albicans ortholog at the Candida Genome Database (CGD) located at Stanford University.
- Fungal Orthogroups Repository: provides a direct link between the S. cerevisiae protein and its fungal orthologs at the Fungal Orthogroups Repository located at the Broad Institute.
- P-POD: provides a direct link between the S. cerevisiae protein and its protein orthologs at the Princeton Protein Orthology Database located at Princeton University.
- PDB Homologs: provides a link to an SGD page where homologs of the current S. cerevisiae protein are shown - proteins with 3D structures available at the RCSB Protein Data Bank.
- PhylomeDB: provides a link between the S. cerevisiae protein and its phylogenetic tree as provided by PhylomeDB at CRG in Barcelona.
- PomBase: provides a direct link between the S. cerevisiae protein and the Schizosaccharomyces pomve ortholog at the PomBase database located at University of Cambridge.
- YGOB (Yeast Gene Order Browser): a tool used to visualize the syntenic context of protein coding genes from S. cerevisiae, S. castellii, C. glabrata, A. gossypii, K. lactis, K. waltii, and S. kluyveri. YGOB was developed by Kevin Byrne and Ken Wolfe (Trinity College, Dublin, Ireland), as described in Byrne and Wolfe.
- YOGY: the eukarYotic OrtholoGY (YOGY) tool is used to view orthologous proteins from eukaryotic organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Arabidopsis thaliana, Drosophila melanogaster, Caenorhabditis elegans, Plasmodium falciparum, Schizosaccharomyces pombe, and Saccharomyces cerevisiae). YOGY provides information from KOGs, Inparanoid, Homologene, OrthoMCL, and manually curated orthologs between S. cerevisiae and S. pombe. YOGY was developed by the Fission Yeast Functional Genomics Team at the Wellcome Trust Sanger Institute, Cambridge, UK.
- BLASTP (NCBI): a direct link to BLASTP at NCBI, facilitates the comparison of the amino acid sequence of the query protein to all non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF excluding environmental samples.
- Protein databases/Other
These links provide access to information on structural assignments to protein sequences at the superfamily level using the SCOP Superfamily, links to mass spec. data at GPMdb, information on the protein from the Munich Information Center for Protein Sequences (MIPS) generated by entering the ORF name into MIPS database search program, Pfam domains from the Wellcome Trust Sanger Institute, and YeastRC Structure Prediction from the YRC Public Data Repository.
- Localization Resources
These links provide access to external databases that contain localization data for many yeast proteins including YPL+ at the University of Graz, Austria, and the Yeast GFP Fusion Localization Database originally at the University of California, San Francisco, but hosted at SGD.
- Post-translational Modifications
All data presented in tables on the Protein page can be downloaded by clicking the download button at the bottom left of each table. All data are downloaded in tab-delimited text format, except the sequence data which is provided in FASTA format.
All of the data displayed on the Protein page, plus additional data, are available from YeastMine. You can search for and download data, or create gene lists and analyze them further using additional YeastMine queries.
YeastMine templates (pre-composed queries) for protein data include:
|Retrieve --> All Proteins||Retrieve length, molecular weight, calculated pI for all proteins encoded by the S. cerevisiae genes.|
|Gene --> Protein Sequence||Retrieve protein sequence for a specified gene.|
|Domain --> Proteins||Retrieve Proteins/Genes that have a given domain.|
|Retrieve --> Proteins in a molecular weight range||Retrieve genes that encode proteins of the selected molecular weight range.|
|Retrieve --> Proteins in a pI range||Retrieve genes of a selected feature type that encode proteins of the selected calculated protein pI range.|