SGD Help: Protein Information

The Protein Information page provides information on all protein-coding ORFs in the Saccharomyces cerevisiae genome. This page contains locus specific nomenclature and gene product information, a brief description of the role of the gene product within the cell, the predicted primary sequence, basic information derived from the sequence and a proteome viewer for enhanced visualization of sequence specific features. Summary links provide access to detailed prediction-based and manually curated referenced information, as well as links to various external resources. Subtabs, located just below the main Protein tab at the top of the page, provide access to more detailed domain/motif information and physico-chemical properties.

Contents

  1. Protein Information
    1. Basic Information
    2. Proteome Browser
    3. Summary Links
    4. Amino Acid Sequence
    5. External Links
  2. Domains/Motifs and Signal Peptides
    1. Proteome Browser Thumbnails
    2. Shared Domains/Motifs
    3. Unique Domains/Motifs
    4. Transmembrane Domains
    5. Signal Peptides
    6. External Links
  3. Protein Physical Properties
    1. Sequence Based Calculations
    2. Coding Region Translation Calculations

Protein Information

Basic Information

The Basic Protein Information section contains several fields of information relevant to the protein of interest. Protein nomenclature is listed first and is followed by some basic information including several descriptive fields and some basic protein information.

  • Standard Name: reflects the standard locus name given to a gene by members of the scientific community, based on the SGD nomenclature guidelines in the following format: relevant gene symbol, non-italic, initial letter uppercase, with the suffix 'p' appended.
  • Systematic Name: reflects the location-based systematic name given to the ORF during the genome sequencing project in the following format: relevant gene symbol, non-italic, initial letter uppercase, with the suffix 'p' appended
  • Alias Name: reflects the alias name given to a gene published under multiple names using the following format: relevant gene symbol, non-italic, initial letter uppercase, with the suffix 'p' appended.
  • Reserved Name: reflects the soon to be published reserved gene name registered with SGD in accordance with the gene-naming guidelines in the following format: relevant gene symbol, non-italic, initial letter uppercase, with the suffix 'p' appended.
  • ORF classification: designation based on the feature qualifier (verified, uncharacterized, or dubious), that indicates the current degree of certainty that an ORF encodes a functional gene product.
  • Description: a concise summary of the biological role and molecular function of the protein and/or gene.
  • Name Description: contains the expanded form of the standard name, as described in the literature.
  • Experimental Data: currently contains the number of molecules/cell, calculated using GFP fusion proteins and quantitative western blot analysis by Ghaemmaghami et al. (2003), as displayed on the Yeast GFP Fusion Localization Database website.
  • Predicted Sequence: contains links to the GCG formatted amino acid sequence displayed lower on the page and a button to download the sequence in FASTA format
  • Length (a.a.): the predicted full length of the translated gene product, calculated using GCG's PEPTIDESORT.
  • Molecular Weight (Da): the predicted molecular weight of the full length protein in daltons (Da), calculated using GCG's PEPTIDESORT.
  • Isoelectric Point (pI): the theoretical isoelectric point (pI) is the pH at which the protein carries no net charge, calculated using GCG's PEPTIDESORT.

Proteome Browser

To aid in the visualization of primary sequence-based protein information, an interactive Proteome Browser has been developed. The graphical image on the Protein Information page (see figure below) is a thumbnail image from the Proteome Browser. Clicking on the thumbnail provides provides access to the interactive Proteome Browser. This browser is a customized version of GBrowse, a genome browser developed by the Generic Model Organism Database (GMOD) project. The Proteome Browser consolidates the display of domains/motifs (predicted by software and datasets assembled by the InterPro database, using InterProScan), transmembrane domains (predicted using TMHMM), signal peptides (identified using SignalP), profile hits (using BlastProDom and ProfileScan, methods based on the generation of profiles from a family of related sequences derived through multiple sequence alignments), and Kyte-Doolittle hydropathy plots.

In both the thumbnail and the interactive Proteome Browser, HMM domains have been color coded based on the source of the prediction, with PIR SUPERFAMILY domains in red, PFAM domains in orange and yellow, GENE3D domains in purple, PANTHER domains in green, TIGRFAM domains in blue and SMART domains in brown. In the Proteome Browser, a mouseover feature has been added to provide additional detailed information regarding the feature of interest. For example, mousing over a domain will provide details concerning the database origin of the domain match, the name and description of the domain, as well as the E-value of the match.

To view a different protein, first click on the thumbnail image to open the Proteome Browser. Then enter the name in the landmark or region text box. The scroll/zoom feature can be used to modify the region of the protein shown in the default view. The default setting displays the predicted full-length protein, and the zoom option can be used to look at a particular region in more detail (zooming in). Note that one cannot zoom out. Tracks shown on the default view can be modified by selecting/deselecting the tracks of interest and then updating the image. User defined tracks of information can also be displayed by simply uploading the file of interest. Additional information concerning the functionality of the proteome browser can be obtained in the general GBrowse help document since the underlying code and functionality of the two viewers are the same.

Summary Links

This section contains summary statements relevent to the information type listed, as well as links to other resources relevent to the protein of interest.

  • Post-translational Modifications: provides links to the PhosphoGRID and Phosphopep databases.
  • Domains/Motifs: contains a link to a table that summarizes information about domains/motifs that are located in the query protein and shared with other yeast proteins. The table also contains a list of domains/motifs contained in the InterPro database for these other S. cerevisiae proteins, but not found in the original query protein sequence. See the Domains/Motifs section (below) for more details.
  • Transmembrane Domains: provides a summary statement regarding the number of transmembrane domains predicted for the query protein. The TMHMM software uses a hidden Markov model (HMM) to model and predict the location and orientation of transmembrane domains in the query protein. See the Domains/Motifs section (below) for more details.
  • Signal Peptides: provides a summary statement regarding the number of signal peptides predicted for the query protein based on the signal sequences identified by the SignalP software. SignalP uses neural networks and hidden Markov models (HMM) to model and predict signal peptides. See the Domains/Motifs section (below) for more details.
  • Physical Interactions: provides a summary link to the complete list of curated physical interactions between the query protein and other yeast proteins, organized according to the technique used to identify them. Each curated physical interaction, includes information on which protein was used as bait and which was the hit, as well as the source of data, the interaction type (type of experiment used to identify the interaction) and the associated reference. Note: this number reflects the total number of reported interactions, which may differ by only the technique used to identify the interaction or the published work from which they were extracted. See the Interactions help page for more details.
  • Homologs: provides links to several internal resources that can be used to identify proteins with sequence similarity to the query protein. This includes the following tools:

    • PDB homologs presents information on proteins of known structure with sequence similarity to the query protein.
    • BLASTP uses the Basic Local Alignment Search Tool to compare the amino acid sequence of the query protein against S. cerevisiae sequence datasets.
    • BLASTP v. fungi (fungal BLAST search) uses the Basic Local Alignment Search Tool to compare the amino acid sequence of the query protein against multiple fungal protein sequence datasets.
    • Fungal Alignment displays the alignments between the amino acid sequence of the query protein and the sequences of orthologs from several closely related, sensu stricto and sensu lato species of Saccharomyces.
    • Synteny Viewer displays the degrees of synteny shared among chromosomes in closely related Saccharomyces species (S. paradoxus, S. mikatae and S. bayanus).
  • External Sequence Databases: provide access to a compendium of Saccharomyces cerevisiae sequence entries for alleles and strains that are located in various external databases including GenBank/EMBL/DDBJ, NCBI, EBI, and MIPS. Sequence entries are listed by accession and/or version numbers according to the source. Addition information is available in the All Associated Sequences help page.
  • External Classifications: lists assignments to the Enzyme Commission (EC) and/or the Transporter Classification (TC) numbers. EC assignments were made by UniProtKB/Swiss-Prot; TC assignments were made as part of the Yeast Transporter Information (YETI) project at Genolevures (De Hertogh B, et al. (2006)).

Amino Acid Sequence

The GCG-formatted amino acid sequence is displayed in 50-residue blocks. Residues are numbered on the left side. Also included is a button to Download the sequence, which loads a flat-text browser page with the amino acid displayed in FASTA format.

External Links

This section provides access to a number of external resources relevant to the query protein. This includes sequence entries located at various homolog related resources, interaction databases, protein databases, and localization resources.

  • Homologs: provide access to several sources of homolog information, when available for the requested protein.

    • BLASTP (NCBI): a direct link to BLASTP at NCBI, facilitates the comparison of the amino acid sequence of the query protein to all non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF excluding environmental samples.
    • Ashbya (AGD): provides a direct link between the S. cerevisiae protein and the Ashbya gosspyii ortholog at the Ashbya Genome Database (AGD) located at the University of Basel.
    • Candida (CGD): provides a direct link between the S. cerevisiae protein and the Candida albicans ortholog at the Candida Genome Database (CGD) located at Stanford University.
    • Candida (CandidaDB): provides a direct link between the S. cerevisiae protein and homolog from Candida albicans at CandidaDB, located at the Institut Pasteur.
    • YGOB (Yeast Gene Order Browser): a tool used to visualize the syntenic context of protein coding genes from S. cerevisiae, S. castellii, C. glabrata, A. gossypii, K. lactis, K. waltii, and S. kluyveri. YGOB was developed by Kevin Byrne and Ken Wolfe (Trinity College, Dublin, Ireland), as described in Byrne and Wolfe.
    • YOGY: the eukarYotic OrtholoGY (YOGY) tool is used to view orthologous proteins from eukaryotic organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Arabidopsis thaliana, Drosophila melanogaster, Caenorhabditis elegans, Plasmodium falciparum, Schizosaccharomyces pombe, and Saccharomyces cerevisiae). YOGY provides information from KOGs, Inparanoid, Homologene, OrthoMCL, and manually curated orthologs between S. cerevisiae and S. pombe. YOGY was developed by the Fission Yeast Functional Genomics Team at the Wellcome Trust Sanger Institute, Cambridge, UK.
  • Interaction Resources

    The links provides access to several external databases containing both genetic and physical interaction data including: BioGRID, the Biomolecular Object Network Database (BOND), BioPIXIE, CYC2008, the Complexome Database, the Database of Interacting Proteins (DIP) at UCLA, and GeneMANIA.

  • Protein databases/Other

    These links provide access to information on structural assignments to protein sequences at the superfamily level using the SCOP Superfamily, links to mass spec. data at GPMdb, information on the protein from the Munich Information Center for Protein Sequences (MIPS) generated by entering the ORF name into MIPS database search program, Pfam domains from the Wellcome Trust Sanger Institute, and YeastRC Structure Prediction from the YRC Public Data Repository.

  • Localization Resources

    These links provide access to external databases that contain localization data for many yeast proteins including OrganelleDB at the University of Michigan, and the YRC Public Image Repository at the University of Washington.

Domains/Motifs and Signal Peptides

The Domains/Motifs and Signal Peptides page displays sequence-based predictive information for protein-coding ORFs in S. cerevisiae. This page contains sections for the display of InterPro-derived, shared and unique domains, motifs, protein family HMMs, TMHMM-derived transmembrane domains and SignalP-derived signal peptides. Up to three Proteome Browser thumbnails may be present to aid in the visualization of sequence-based predictions including domains/motifs, transmembrane domains and signal peptides. InterProScan-derived shared and unique domains/motifs are also presented in tabular form, as are the coordinates for predicted transmembrane domains and signal peptides. Finally, an External Links section is provided so that external databases can be searched directly for protein specific domain/motif information.

Proteome Browser Thumbnails

To aid in the visualization of primary sequence-based protein information, an interactive Proteome Browser has been developed. The graphical image on the Protein Information page (see figure below) is a thumbnail image from the Proteome Browser. Clicking on the thumbnail provides provides access to the interactive Proteome Browser. This browser is a customized version of GBrowse, a genome browser developed by the Generic Model Organism Database (GMOD) project. The Proteome Browser consolidates the display of domains/motifs (predicted by software and datasets assembled by the InterPro database, using InterProScan), transmembrane domains (predicted using TMHMM), signal peptides (identified using SignalP), profile hits (using BlastProDom and ProfileScan, methods based on the generation of profiles from a family of related sequences derived through multiple sequence alignments), and Kyte-Doolittle hydropathy plots.

In both the thumbnail and the interactive Proteome Browser, HMM domains have been color coded based on the source of the prediction, with PIR SUPERFAMILY domains in red, PFAM domains in orange and yellow, GENE3D domains in purple, PANTHER domains in green, TIGRFAM domains in blue and SMART domains in brown. In the Proteome Browser, a mouseover feature has been added to provide additional detailed information regarding the feature of interest. For example, mousing over a domain will provide details concerning the database origin of the domain match, the name and description of the domain, as well as the E-value of the match.

A second Proteome Browser thumbnail may be present in the transmembrane domain section if such domains have been identified by TMHMM. This thumbnail displays the relative location of the predicted transmembrane domains with specific amino acid coordinates listed in the associated table. Finally, a third thumbnail will be present if signal peptides have been predicted for the protein of interest. This thumbnail displays the relative location of signal peptide(s) with specific amino acid coordinates listed in the associated table.

To view a different protein, first click on the thumbnail image to open the Proteome Browser. Then enter the name in the landmark or region text box. The scroll/zoom feature can be used to modify the region of the protein shown in the default view. The default setting displays the predicted full-length protein, and the zoom option can be used to look at a particular region in more detail (zooming in). Note that one cannot zoom out. Tracks shown on the default view can be modified by selecting/deselecting the tracks of interest and then updating the image. User defined tracks of information can also be displayed by simply uploading the file of interest. Additional information concerning the functionality of the proteome browser can be obtained in the general GBrowse help document since the underlying code and functionality of the two viewers are the same.

Shared Domains/Motifs

The table in the shared domains/motifs section provides information about other Saccharomyces cerevisiae proteins that also contain the Domains, Motifs and protein family HMMs identified in the original Query Protein sequence. The following image of the Shared Domains table is an excerpt of the results produced using Hxt1p as the query sequence. The first column of this table provides the name of other S. cerevisiae proteins with links to the respective Protein Information pages. The middle column shows a list of all Domains/Motifs that are found in both the original Query Protein sequence (e.g. Hxt1p) and also in another S. cerevisiae protein sequence (e.g. YBR241Cp or Mal31p or Git1p). The third column shows a list of Domains, Motifs and protein family HMMs found in the InterPro database for these other S. cerevisiae proteins, (e.g. Git1p), but not found in the original Query Protein sequence. Clicking on the accession number takes you to a page describing the specific domain, motif or protein family HMM at the database of origin.

The results displayed on this section of the page were derived by comparing yeast protein sequences using the InterProScan program (Quevillon E et al. (2005)). Briefly, InterProScan is a tool that combines different protein signature recognition methods into one resource. The Interpro database integrates motif, domain and protein family HMM information from the following member databases: PROSITE, PRINTS, Pfam, ProDom, SMART, TIGERFAMs, Gene3D, PANTHER and PIR SUPERFAMILY. Scanning methods and cut-offs recommended by the member databases are used in the InterProScan.

Unique Domains/Motifs

The table displayed in this section of the page contains a list of domains/motifs that are unique to the query protein (i.e. not shared by other yeast proteins). In the example shown below, a unique domain identified in Cdc28p is listed in the table. The first column of this table contains the database source of the domain, the second column contains accession number of the unique domain and the third column provides a description of the domain. Clicking on the accession number takes you to a page describing the specific domain/motif at the database of origin.

Transmembrane Domains

Transmembrane Domain(s) were calculated using version 2.0 of TMHMM, an application available at The Center for Biological Sequence Analysis at the Technical University of Denmark DTU. If transmembrane domains have been predicted for the query protein, a table, containing both a Proteome Browser thumbnail displaying the relative location of the predicted domains and the specific amino acid coordinates will be present. In the example illustrated below two transmembrane domains were identified in the Air2p, one between amino acids 743 and 765, and a second between amino acids 772 and 794. If no transmembrane domains are predicted to be present a message will be displayed in place of the table.

Signal Peptides

Signal sequences serve to direct the proteins from the cytosol to their destination (ER, mitochondria etc). There are two types of such sequences: Sorting Signal Peptides/Sequences and Signal patches. Signal patch sequences are very difficult to predict and are not displayed on the protein pages at SGD, while the amino acids predicted to be encode the sorting signal peptide are indicated. Signal Peptides were predicted using version 3.0 of SignalP, an application available at The Center for Biological Sequence Analysis at the Technical University of Denmark DTU. If signal peptides have been predicted for the query protein, a table, containing both a Proteome Browser thumbnail displaying the relative location of the predicted signal peptide and the specific amino acid coordinates, will be present. In the example illustrated below a signal peptide was identified in MF(alpha)1p (alpha factor), located between amino acids 1 and 19. If no signal peptides are predicted a message will be displayed in place of the table.

Physical Properties

Sequence Based Calculations

  • Amino Acid Composition

    The Amino Acid Composition is based on the primary sequence. The table contains three columns: the first lists both the three- and one- letter designations for the twenty amino acids, the second column lists the number of amino acids present in one molecule, and the third contains the composition expressed as a percentage. Values in this table were calculated using GCG's PEPTIDESORT.

  • Atomic Composition

    The Atomic Composition Table displays the composition of the protein, with respect to the number of atoms of carbon, hydrogen, nitrogen, oxygen, and sulfur that it contains as well as the total number of atoms and the resulting formula. Values in this table were calculated using the ProtParam tool, available at ExPASy.

  • Estimated Half-life

    The estimated half-life is a prediction of the time required for half of a synthesized protein to turn-over both in vitro and in vivo. This value is calculated based on Varshavsky's "N-end rule", which predicts protein half-life based on the identify of the N-terminal amino acid residue of a protein (reviewed in Varshavsky, 1996, and Varshavsky, 1997). The N-terminal residue plays an important role in the determination of in vivo protein stability. The ordering of protein half-life was determined by creating a series of ubiquitin-beta-gal fusion proteins in yeast where the identity of the amino terminal residue beta-gal residue was varied. When expressed in yeast the ubiquitin moeity was cleaved exposing various residues at the N-termini. The half-lives of these proteins varied greatly from less than 3 minutes to greater than 30 hours depending on the identity of the residue (Bachmair et al., 1986). Similar experiments were carried out in E. coli and in mammalian reticulocytes (Gonda et al., 1989 and Tobias et al., 1991). Estimated half-lives were calculated using the ProtParam tool, available at ExPASy and are not applicable for N-terminally modified proteins. Approximate half-lives of proteins in the three systems analyzed are summarized in the following table (taken from Varshavsky, 1997 and Gonda et al., 1989).

    N-end rule and corresponding half-life of X-beta-gal
    
     Residue X       Yeast       E.coli     Mammalian
     Ala           >30 hour    >10 hour     4.4 hour
     Arg		 2 min      2 min	1.0 hour
     Asn		 3 min     >10 hour     1.4 hour
     Asp             3 min     >10 hour     1.1 hour
     Cys           >30 hour    >10 hour     1.2 hour
     Gln            10 min     >10 hour     0.8 hour
     Glu            30 min     >10 hour       1 hour
     Gly           >30 hour    >10 hour      30 hour
     His             3 min     >10 hour     3.5 hour
     Ile            30 min     >10 hour      20 hour
     Leu             3 min       2 min      5.5 hour
     Lys             3 min       2 min      1.3 hour
     Met           >30 hour    >10 hour      30 hour
     Phe             3 min       2 min      1.1 hour
     Pro            >5 hour      ?          >20 hour
     Ser           >30 hour    >10 hour     1.9 hour
     Thr           >30 hour    >10 hour     7.2 hour
     Trp             3 min       2 min      2.8 hour
     Tyr            10 min       2 min      2.8 hour
     Val           >30 hour    >10 hour     100 hour

  • Instability Index

    The instability index was developed based on a statistical analysis of 12 unstable and 32 stable proteins (Guruprasad et al., 1990). This analysis revealed the presence of certain dipeptides that occurred with significantly different frequencies between stable and unstable proteins. A dipeptide instability weight value (DIWV) was assigned to each of 400 different dipeptides. These weight values were then used to calculate an instability index (II) defined as:

                       i=L-1
    II = (10/L) * Sum     DIWV(x[i]y[i+1])
                       i=1
    
    where: L is the length of sequence
           DIWV is the instability weight value
           and x[i]y[i+1] is a dipeptide starting at position i.

    Proteins with an instability index less than 40 are predicted to be stable, whereas those with a value greater than 40 are predicted to be unstable.

  • Extinction Coefficient

    The extinction coefficient (epsilon) is the wavelength-dependent molar absorptivity coefficient with units of M-1 cm-1. The extinction coefficient provides an indication of the amount of light that a given protein will absorb at a certain wavelength (usually 280 nm). During protein purification a spectrophotometer can be used to follow the protein of interest if the extinction coefficient is known. The molar extinction coefficient of a protein can be estimated based on its amino acid composition. The extinction coefficient of the native protein in water can be calculated based on the molar extinction coefficient of tyrosine, tryptophan and cystine (cysteine does not absorb much at wavelengths greater than 260 nm while cystine does) using the following equation:

    E(Prot) = Numb(Tyr)*Ext(Tyr) + Numb(Trp)*Ext(Trp) + Numb(Cystine)*Ext(Cystine)
    
    where: Ext(Tyr) = 1490
           Ext(Trp) = 5500
           Ext(Cystine) = 125

    The absorbance (optical density) can then be calculated using the following formula:

    Absorb(Prot) = E x l x C
    
    where: E = extinction coefficient
           l = pathlength (cm)
           C = protein concentration (M)

    Two extinction coefficient values are calculated by ProtParam, the first value is based on the assumption that all cysteine residues appear as half cystines, and the second assumes that no cysteines appear as half cystines. The computation has been demonstrated to be quite reliable for proteins that contain Trp residues, but for proteins without Trp residues there may be more than a 10% error.

    These calculations are based on the method developed by Edelhoch, 1967, using extinction coefficients for Trp and Tyr, as determined by Pace et al., 1995. The values used in the calculation of extinction coefficients for denatured proteins were also found to be accurate for calculating coefficients for the native protein (Gill and von Hippel, 1989). In general, since Trp residues contribute much more to the overall extinction coefficient than Tyr and cystine residues, the calculations tend to be much closer to measured values for proteins that contain Trp residues.

  • Aliphatic Index

    The aliphatic index refers to the relative volume of a protein that is occupied by aliphatic side chains (alanine, isoleucine, leucine and valine) and contributes to the increased thermostability observed for globular proteins. The aliphatic index of a protein is calculated according to the following formula (Ikai, 1980):

    Aliphatic index = X(Ala) + a * X(Val) + b * ( X(Ile) + X(Leu) )  
    
    where X(Ala), X(Val), X(Ile), and X(Leu) are mole percent (100 X mole fraction) of alanine,
    valine, isoleucine, and leucine. The coefficients a and b are the relative volume of valine
    side chains (a = 2.9) and of Leu/Ile side chains (b = 3.9) relative to that of alanine side
    chains.

Coding Region Translation Calculations

Values for Codon Bias Index (CBI), Codon Adaptation Index (CAI), Frequency of Optimal Codons (Fop), Hydropathicity of Protein (GRAVY score), and Aromaticity Score (AROMO) are calculated based on the specific genetic code and codon usage of a given organism and organelle. These values were calculated using the CodonW software program written by John Peden.

CodonW analyzes the correspondence between amino acids and codon usage in a set of protein sequences, based on a given genetic code (i.e. that used in the S. cerevisiae nucleus versus that used in its mitochondrion). CodonW was designed to work with any genetic code. Decisions regarding whether an amino acid is synonymous or non-synonymous, the translation of a codon, the number of codons in a codon family, how many synonyms a codon has, are all determined at run time. Seven alternatives to the universal genetic code have been built in to the program, including S. cerevisiae chromosomal codon usage and S. cerevisiae mitochondrial codon usage. In SGD, we have used these two built-in options, as appropriate, to perform codon usage-based calculations for chromosomally-encoded or mitochondrially-encoded ORFs. Note that codon usage-based calculations are not currently performed for ORFs present within transposable elements (Ty elements), because the codon usage of transposable element genes differs from that of chromosomal genes (see the CodonW tutorial).

Go to Protein Information