SGD Help: GO Term Finder
The Gene Ontology (GO) project was established to provide a common language to describe aspects of a gene product's biology. A gene product's biology is represented by three independent structured, controlled vocabularies: molecular function, biological process and cellular component. For more information on GO, see the SGD GO Help page or the GO home page.
To provide the most detailed information available, gene products are annotated to the most granular GO term(s) possible. For example, if a gene product is localized to the perinuclear space, it will be annotated to that specific term only and not the parent term nucleus. In this example the term perinuclear space is a child of nucleus. However, for many purposes, such as analyzing the results of microarray expression data, it is very useful to "calculate" on GO, moving up the GO tree from the specific terms used to annotate the genes in a list to find GO parent terms that the genes may have in common. The GO Term Finder tool allows you to do this.
The GO Term Finder is described in detail in Boyle et al (2004).
- Using GO Term Finder
- Method/Algorithm Description
The query page has several options as described below.
- Optional: Add a title for your search results
This is an optional step that allows you to name your analysis and keep track of the results.
- Step 1: Enter your gene(s)
You can either type the name of the genes in the input box or upload a file that contains the genes names. Note that the program requires more time to process a long list (greater than 100 genes) than a short list.
- Step 2: Choose your ontology
Select one of the three (biological process, molecular function, or cellular component) ontologies by clicking the appropriate radio button. The tool searches only one ontology at a time to minimize the search time.
- Click the Search button after Step 2 to search using the default settings, or go to Steps 3, 4, and 5 to customize your search, as described below.
- Optional Step 3: Specify your background set.
The default background set includes all the features/gene names in the database that have at least one GO annotation. This step allows you to enter or upload a specific list of genes for the background set. You can also customize the background set of genes (default or your specific set) by specifying feature type and/or ORF qualifier.
- Optional Step 4: Refine the Annotations used for Calculations.
You can refine the annotations to genes in your background set using three different criteria: Annotation Method, Annotation Source, and Evidence Code. The effect of refining the annotations is explained with an example in the Results section below. The GO Term Finder at SGD queries Manually curated and High-throughput annotations only and does not query annotations obtained using computational methods.
- Optional Step 5: Select a p-value cutoff for results and/or toggle False Discovery Rate
p-value:The default p-value is set to < 0.01. The pull-down allows you to change the p-value cut-off to view hits with a less stringent cut-off.
FDR: Check this box if you would like the GO Term Finder program to calculate the False Discovery Rate (FDR) for each node. The FDR is calculated by running 50 simulations with random genes, and counting the average number of times a p-value as good as or better than a p-value generated from the real data is seen. This is used as the numerator. The denominator is the number of p-values in the real data that are as good as or better than it. Thus, instead of setting your cutoff based on p-value, the FDR allows you to choose a cutoff that has an acceptable level of false discovery.
The results page displays, in both graphic and table form, the significant shared GO terms (or parents of GO terms) used to describe the set of genes entered on the previous page. In addition, the results page displays all the criteria used to customize the Background set and Annotations in the background set.
The graphic illustrates the relationships among the GO terms used to directly or indirectly describe the genes in your list. The figure below shows a section of the graphic obtained from querying the gene list in the example below.
The color of each box indicates the p value score (see description of the method below). Genes associated with the GO terms are shown in gray boxes. Each GO term links to the SGD GO term page, where you can view the parent-child relationships involving that term as well as other genes annotated to it. Each gene links to its SGD Locus page.
In some cases, the number of GO terms is too large to display on a web page. When this occurs, the most significant terms are shown. Regardless of the significant number of terms returned, an option to download the complete set of results is always available.
The table below the graph (see example below for a sample table) lists each significant GO term, the number of times the GO term is used to annotate genes in the list (or cluster), and the number of times that the term is used to annotate genes in the background set. The default for the background set is all the genes/features that have at least one GO annotation in the database. As of October 2011, the total number of annotated genes in SGD is 7168 (referred to as N in the Algorithm below) and this includes all Open Reading Frames (Verified, Uncharacterized and Dubious ORFs), tRNAs, rRNAs, repeated ORFs, genes in the mitochondrial genome and genes not present in the systematic sequence of S288C, that have a GO annotation. In addition, the p-value and all the genes annotated, either directly or indirectly, to the term are provided.
To determine the statistical significance of the association of a particular GO term with a group of genes in the list, GO Term Finder calculates the p-value: the probability or chance of seeing at least x number of genes out of the total n genes in the list annotated to a particular GO term, given the proportion of genes in the whole genome that are annotated to that GO Term. That is, the GO terms shared by the genes in the user's list are compared to the background distribution of annotation. The closer the p-value is to zero, the more significant the particular GO term associated with the group of genes is (i.e. the less likely the observed annotation of the particular GO term to a group of genes occurs by chance). A customizable web implementation of the GO Term Finder toolthat allows the user to set a p-value cut-off (which uses the same algorithm as the SGD's tool) is available at Princeton University.
GO annotations in the background set can be refined to include annotations from a particular GO Annotation Source (for example, SGD is a source), GO Annotation Method and Evidence Codes. Refining annotations will not drop the number of genes in the background set, but will remove genes annotated to a particular term from the calculations. This is explained with the example below. Consider the following set of input genes:
ABP140 ACF2 ACF4 AKL1 APP1 ARC35 ARF3 BBC1 BEM2 BEM4 CDC1 GEA1 GEA2 MSS4 PIN3 SDA1 SFK1 SHE4 SIT4 SLG1 SLM1 SLM2 SSK2 STT4 VIP1 VPS1 WSC2 WSC3
Search results along with the filtering criteria for this input list for Process Ontology are shown below. The first analysis used the default settings (top table) and the second analysis filtered out annotations with the evidence code IGI (bottom table). Only the top significant hits are shown in the tables below to illustrate the point. (Note that GO annotations are continuously updated at SGD. These results were computed in Oct 2011 and may be different at later times.)
Results with default background settings
Results after filtering annotations with IGI evidence codes
Important points to note:
- The number of genes in the background set (7168) is the same in both cases as shown in the column titled 'Background Frequency'.
- The number of genes annotated to the term 'actin cytoskeleton organization' in the background is different in the two cases. Results with the default settings show 108 genes, while the results after filtering IGI annotations show 101 genes. Seven genes annotated to the GO term 'actin cytoskeleton organization' have been filtered out for calculating the p-value for 'actin cytoskeleton organization'.
- Cluster Frequency: This column shows the number of genes in the input list that are annotated to that GO term. This number is different for the two cases (24/28 and 20/28) because once the genes annotated using IGI evidence code have been removed from the background set, those genes are removed from the input list also because it is not meaningful to calculate significance of something that is not in the background set.
- p-value: Filtering increases the p-value from 3.81e-39 to 4.98e-30 for the term 'actin cytoskeleton organization'; thus filtering causes a GO term to be less significant for the input list.
When including results from this tool in a publication, consider the following:
- GO annotations are continuously updated at SGD. As a result, one might not be able to reproduce a given set of results on a different date. Therefore it can be important to mention the date the analysis was done in the publication.
- Mentioning details of the background set, including the number of genes in the background set and p-value cut-off used can be useful.
Genes are directly associated with GO terms that are as granular as possible. Because the GO terms have hierarchical relationships with each other, genes are also considered to be indirectly associated with all the parents of the granular terms to which they are directly associated.
The tool looks for significant shared GO terms that are directly or indirectly associated with the genes in the list. To determine significance, the algorithm examines the group of genes to find GO terms to which a high proportion of the genes are associated as compared to the number of times that term is associated with other genes in the genome. For example, when searching the process ontology, if all of the genes in a group were associated with "DNA repair", this term would be significant. However, since all genes in the genome (with GO annotations) are indirectly associated with the top level term "biological_process", it would not be significant if all the genes in a group were associated with this very high level term.
Note that this version of GO Term Finder uses a hypergeometric distribution with Multiple Hypothesis Correction (i.e., Bonferroni Correction) to calculate p-values. A stand-alone, generic version of GO Term Finder that uses a hypergeometric distribution, with Bonferroni Correction and False Discovery Rate, can be downloaded here.
If G is the number of genes annotated to a term (either directly or indirectly) and N is the total number of genes in the genome with GO annotations (see Results Table section above for details on this number), then p, the probability of a randomly selected gene being annotated to a particular GO term can be calculated as:
G - N
Given a list of n genes, in which x of them have been annotated to a given GO term (directly or indirectly), the probability of having x out ofn annotations assigned to the same GO term by chance is defined as the product of the number of permutations by which the annotations can occur and the following equation:
px x (1-p)(n-x)
Within a list of n genes, there are multiple permutations by which x of them may have this annotation. The number of permutations can be calculated as:
n! -------- x!(n-x)!
However, annotations to a particular term are low probability events (p is small). Because of this, any list of genes having a particular set of annotations is likely to have a low probability, but not necessarily a significant one. Thus, instead of calculating the probability of having x of n genes annotated to a term, a more conservative approach, often used by statisticians, is taken to calculate the probability of x or more of n genes being annotated to a particular term. Since GO annotations are still incomplete (i.e. there may be more than xgenes annotated to a particular term), this is appropriate. This is calculated as:
Go to GO Term Finder