SGD Help: GO Term Finder
Contents
The Gene Ontology (GO) project was established to provide a common
language to describe aspects of a gene product's biology. A gene
product's biology is represented by three independent structured,
controlled vocabularies: molecular function, biological process and
cellular component. For more information on GO, see the SGD
GO Tutorial, the SGD
GO Help page, or the GO
consortium home page.
To provide the most detailed information available, gene products are
annotated to the most granular GO term(s) possible. For example, if a
gene product is localized to the perinuclear space, it will be
annotated to that specific term only and not the parent term
nucleus. In this example the term perinuclear space is a child of nucleus.
However, for many purposes, such as analyzing the results of
microarray expression data, it is very useful to "calculate" on GO,
moving up the GO tree from the specific terms used to annotate the
genes in a list to find GO parent terms that the genes may have in
common. The GO Term Finder tool allows you to do this.
The GO Term Finder is described in detail in Boyle et al (2004).
The query page has several options as described below.
- Step 1: Enter your gene(s)
You can either type the name of the genes in the input box or upload a file
that contains the genes names. Note that the program requires more
time to process a long list (greater than 100 genes) than a short list.
- Step 2: Choose your ontology
Select one of the three (biological
process, molecular function, or cellular component) ontologies by
clicking the appropriate radio button. This tool is designed to search only one of the three ontologies at a given time in order to minimize the searching time.
- Click the Search button after Step 2 to search
using the default settings or go to Steps 3 and 4 to specify and customize
your background set and/or refine the annotations in your background
set.
- Step 3: Specify your background set.This is an optional step
that allows you to specify a background set of genes. The default
background set includes all the features/gene names in the database
that has at least one GO annotation.
You can also customize the background set of genes (default or
your specific set) by specifying feature type and/or feature qualifier.
- Step 4: Refine the Annotations used for Calculations.
This is also an optional step and allows you to refine the annotations
to genes in your background set using three different criteria.
The effect of refining the annotations is explained with an example in the
next section.
The GO Term Finder at SGD queries Manually curated and High-throughput
annotations only and does not query annotations obtained using
computational methods.
- Step 5: Select a p-value cutoff for results.
This is another optional step that allows you to view hits with a less
stringent p-value cut-off. The default p-value is set to < 0.01.
The results page displays, in both graphic and table form, the
significant shared GO terms (or parents of GO terms) used to describe
the set of genes entered on the previous page. In addition, the
results page displays all the criteria used to customize the
Background set and Annotations in the background set.
Graphic Display
The graphic illustrates the relationships
among the GO terms used to directly or indirectly describe the genes
in your list. The color of each box indicates the p value score (see description
of the method below). Genes associated with the
GO terms are shown in gray boxes. Each GO term links to the SGD GO
term page, where you can view the GO structure around that term as
well as other genes associated with it. Each gene links to its SGD
Locus page.In some cases, the number of GO terms is too large to
display on a web page. When this occurs, the most significant terms
are shown. Regardless of the significant number of terms returned, an
option to download the complete set of results is always available.
To generate the graphics, the program utilizes CPAN's GraphViz perl wrapper module that uses
AT&T's graphviz tool.
Results Table
The table below the graph lists each significant GO term, the number
of times the GO term is used to annotate genes in the list (or
cluster) and the number of times that the term is used to annotate
genes in the background set. The default for the background set is
all the genes/features that have at least one GO
annotation in the database. As of March 2008, the total number of annotated
genes in SGD is 7155 (referred to as N in the Algorithm below) and this includes all Open Reading Frames
(Verified, Uncharacterized and Dubious ORFs),
tRNAs, rRNAs, repeated ORFs, genes in the mitochondrial genome and
genes not present in the systematic sequence of S288C, that have a GO annotation. In addition,
the p-value and all the genes annotated, either directly or indirectly, to the
term are provided.
To determine the statistical significance of the association of a
particular GO term with a group of genes in the list, GO Term Finder
calculates the p-value: the probability or chance of seeing at least x
number of genes out of the total n genes in the list annotated to a
particular GO term, given the proportion of genes in the whole genome
that are annotated to that GO Term. That is, the GO terms shared by
the genes in the user's list are compared to the background
distribution of annotation. The closer the p-value is to zero, the more significant the particular GO term associated with the group of genes is (i.e. the less likely the observed annotation of the particular GO term to a group of genes occurs by chance).
A customizable web
implementation of the GO Term Finder tool that
allows the user to set a p-value cut-off (which uses the same
algorithm as the SGD's tool) is available at Princeton University.
Effect of refining Annotations in the background set
GO annotations in the background set can be refined to include
annotations from a particular GO Annotation Source (for
example, SGD is a
source), GO Annotation Method and Evidence Codes. Refining annotations will
not drop the number of genes in the background set, but will remove
genes annotated to a particular term from the calculations. This is
explained with the example below.
Consider the following set of input genes:
ABP140
ACF2
ACF4
AKL1
APP1
ARC35
ARF3
BBC1
BEM2
BEM4
CDC1
GEA1
GEA2
MSS4
PIN3
SDA1
SFK1
SHE4
SIT4
SLG1
SLM1
SLM2
SSK2
STT4
VIP1
VPS1
WSC2
WSC3
Search results along with the filtering criteria for this input list for Process Ontology is shown below: one with the
default settings (top table) and one by filtering out annotations with the
evidence code IGI (bottom table). Only the top 3 significant hits are
shown in the tables below to illustrate the point.
Important values to note are:
- The number of genes in the background set (7292) is the same in both
cases as shown in the column titled 'Background Frequency'.
- The number of genes annotated to the term 'actin cytoskeleton
organization and biogenesis' in the background is different in the
two cases. Results with the default settings show 104 genes annotated to 'actin cytoskeleton
organization and biogenesis' while the results after filtering
annotations show 98 genes annotated to 'actin cytoskeleton
organization and biogenesis. Six genes annotated to the GO term 'actin cytoskeleton
organization and biogenesis' have been filtered out for calculating
the P-value for 'actin cytoskeleton
organization and biogenesis'.
- Cluster Frequency: The number of genes reported in this column
reflects how many genes in your input list are annotated to that GO
term. This number is different for the two cases (28/28 and 23/28)
because once the genes annotated using IGI evidence code have been
removed from the background set, those genes are removed from the
input list also because it is not meaningful to calculate significance
of something that is not in the background set.
- P-value: Filtering increases the P-value from 7.84e-52
to 1.01e-37 for the term 'actin cytoskeleton
organization and biogenesis'; thus filtering causes a GO term to be less
significant for the input list.
Note: GO annotations are continuously updated at SGD. These
results were computed in Sept 2006 and may not be duplicatable at
later times.
Results with default background settings
Background gene set: Default
7292 genes based on the following filtering
criteria:
Feature Type(s) included:
ORF, ncRNA, not in systematic sequence of S288C, not
physically mapped, pseudogene, rRNA, snRNA, snoRNA, tRNA,
transposable_element_gene
Feature Qualifier(s) included:
Dubious, Uncharacterized, VerifiedAnnotations: Default
Annotation Source(s) included:
SGD
Annotation Method(s) included: Manually curated, high-throughput
Evidence Code(s) included:
RCA, NR, ND, NAS, TAS, IGI, IC, IDA, IEP, IPI, ISS,
IEA, IMP
| Terms from
the Process Ontology |
| Gene Ontology term | Cluster
frequency | Background frequency | P-value | Genes annotated to the
term |
| actin
cytoskeleton organization and biogenesis | AmiGO | 28
out of 28 genes, 100.0% | 104 out of 7292 background genes,
1.4% | 7.84e-52 | ABP140,
ACF2,
ACF4,
AKL1,
APP1,
ARC35,
ARF3,
BBC1,
BEM2,
BEM4,
CDC1,
GEA1,
GEA2,
MSS4,
PIN3,
SDA1,
SFK1,
SHE4,
SIT4,
SLG1,
SLM1,
SLM2,
SSK2,
STT4,
VIP1,
VPS1,
WSC2,
WSC3 |
| actin
filament-based process | AmiGO | 28
out of 28 genes, 100.0% | 108 out of 7292 background genes,
1.5% | 2.65e-51 | ABP140,
ACF2,
ACF4,
AKL1,
APP1,
ARC35,
ARF3,
BBC1,
BEM2,
BEM4,
CDC1,
GEA1,
GEA2,
MSS4,
PIN3,
SDA1,
SFK1,
SHE4,
SIT4,
SLG1,
SLM1,
SLM2,
SSK2,
STT4,
VIP1,
VPS1,
WSC2,
WSC3 |
| cytoskeleton
organization and biogenesis | AmiGO | 28
out of 28 genes, 100.0% | 215 out of 7292 background genes,
2.9% | 4.64e-42 | ABP140,
ACF2,
ACF4,
AKL1,
APP1,
ARC35,
ARF3,
BBC1,
BEM2,
BEM4,
CDC1,
GEA1,
GEA2,
MSS4,
PIN3,
SDA1,
SFK1,
SHE4,
SIT4,
SLG1,
SLM1,
SLM2,
SSK2,
STT4,
VIP1,
VPS1,
WSC2,
WSC3 |
Results after filtering annotations with IGI evidence codes
Background gene set: Default
7292 genes based on the following filtering
criteria:
Feature Type(s) included: ORF, ncRNA,
not in systematic sequence of S288C, not physically mapped,
pseudogene, rRNA, snRNA, snoRNA, tRNA,
transposable_element_gene
Feature Qualifier(s) included:
Dubious, Uncharacterized, VerifiedAnnotations: Custom
Annotation Source(s) included: SGD
Annotation Method(s) included: Manually curated, high-throughput
Evidence Code(s) included:
RCA, NR, ND, NAS, TAS, IC, IDA, IEP, IPI, ISS, IEA,
IMP
| Terms from the Process
Ontology |
| Gene
Ontology term | Cluster frequency | Background frequency | P-value | Genes annotated to the
term |
| actin
filament-based process | AmiGO | 23
out of 28 genes, 82.1% | 98 out of 7292 background genes,
1.3% | 1.01e-37 | ABP140,
ACF2,
ACF4,
AKL1,
APP1,
ARC35,
BBC1,
BEM4,
CDC1,
MSS4,
PIN3,
SDA1,
SFK1,
SHE4,
SIT4,
SLG1,
SLM1,
SLM2,
SSK2,
STT4,
VPS1,
WSC2,
WSC3 |
| actin
cytoskeleton organization and biogenesis | AmiGO | 23
out of 28 genes, 82.1% | 98 out of 7292 background genes,
1.3% | 1.01e-37 | ABP140,
ACF2,
ACF4,
AKL1,
APP1,
ARC35,
BBC1,
BEM4,
CDC1,
MSS4,
PIN3,
SDA1,
SFK1,
SHE4,
SIT4,
SLG1,
SLM1,
SLM2,
SSK2,
STT4,
VPS1,
WSC2,
WSC3 |
| cytoskeleton
organization and biogenesis | AmiGO | 23
out of 28 genes, 82.1% | 200 out of 7292 background genes,
2.7% | 5.63e-30 | ABP140,
ACF2,
ACF4,
AKL1,
APP1,
ARC35,
BBC1,
BEM4,
CDC1,
MSS4,
PIN3,
SDA1,
SFK1,
SHE4,
SIT4,
SLG1,
SLM1,
SLM2,
SSK2,
STT4,
VPS1,
WSC2,
WSC3 |
Here are some important points to note when including results from
this tool in a publication.
- GO annotations are continuously updated at SGD. As a result,
one might not be able to reproduce a given set of results on a different
date. Mentioning the date when the analysis was done in the
publication can be useful.
- Mentioning details of the background set, including the number of
genes in
the background set and p-value cut-off used can be useful.
Genes are directly associated with GO terms that are as granular as
possible. Because the GO terms have hierarchical relationships with
each other, genes are also considered to be indirectly associated with
all the parents of the granular terms to which they are directly
associated.
The tool looks for significant shared GO terms that are directly or
indirectly associated with the genes in the list. To determine
significance, the algorithm examines the group of genes to find GO
terms to which a high proportion of the genes are associated as compared
to the number of times that term is associated with other genes in the
genome. For example, when searching the process ontology, if all of
the genes in a group were associated with "DNA repair", this term
would be significant. However, since all genes in the genome (with GO
annotations) are indirectly associated with the top level term
"biological_process", it would not be significant if all the genes in
a group were associated with this very high level term.
Notes: This version of GO Term Finder uses a hypergeometric
distribution with Multiple Hypothesis Correction (i.e., Bonferroni
Correction) to calculate p-values. A stand-alone,
generic version of GO Term Finder that uses a hypergeometric
distribution, with Bonferroni Correction and False Discovery Rate, can be downloaded here.
Algorithm Details:
If G is the number of genes annotated to a term (either directly or
indirectly) and N is the total number of genes in the genome with GO
annotations (please see Results Table section above for details on
this number), then p, the probability of a randomly selected gene being
annotated to a particular GO term can be calculated as:
G
-
N
Given a list of n genes, in which x of them have been annotated to a
given GO term (directly or indirectly), the probability of having x
out of n annotations assigned to the same GO term by chance is defined
as the product of the number of permutations by which the annotations
can occur and the following equation:
px x (1-p)(n-x)
Within a list of n genes, there are multiple permutations by which x
of them may have this annotation. The number of permutations can be
calculated as:
n!
--------
x!(n-x)!
However, annotations to a particular term are low probability events
(p is small). Because of this, any list of genes having a particular
set of annotations is likely to have a low probability, but not
necessarily a significant one. Thus, instead of calculating the
probability of having x of n genes annotated to a term, a more
conservative approach, often used by statisticians, is taken to
calculate the probability of x or more of n genes being annotated to a
particular term. Since GO annotations are still incomplete (i.e. there
may be more than x genes annotated to a particular term), this is
appropriate. This is calculated as:
A more current and detailed description of the algorithm can be
downloaded from CPAN.
Thanks to Gavin Sherlock at SMD
for the description of the algorithm.