SGD Help: DNA Similarity Data Generation Details
Contents
Pairwise Chromosome Similarity View (GDSV) and Genomic Stripe View (GSV) are designed to show areas of similarity, or duplication, between different chromosomes in the yeast genome. They use static files of the locations of sequence similarity for each pair of chromosomes to generate similarity matrix images on the fly.
The data sets used by GDSV and GSV were generated by a search program that reads in the raw genomic DNA sequence data and records a list of sequence locations for similarities that meet a host of criteria. These criteria, described in greater detail below, impose requirements regarding the overall length of each similarity as well as the number and pattern of matches and mismatches within the similarity. The process of searching a matrix for similarities is described below.
To begin the search for similarities, a matrix is set up between two particular chromosomes. One sequence is plotted along one axis of the matrix, the second along the other axis. This matrix is then scanned for similarities. Similarities show up as diagonal lines of matches between the two chromosomes in the matrix. The matrix is scanned twice, once for direct similarities, and a second time for inverted similarities.
Similarities are recorded for display by GDSV and GSV only if they meet a list of requirements. If a potential similarity fails to meet any of the requirements, it is not recorded and therefore will not be displayed by GDSV or GSV. These requirements specify aspects of a similarity such as minimum overall number of matches or maximum overall number of base changes, insertions, and deletions. They also specify additional features such as minimum number of consecutive matches between any two mismatches, etc. These requirements are outlined in Table 1.
Table 1 Search Program Similarity Requirements |
| Requirement | Limit | Description |
| Matches | 15 | Minimum overall matches required in a similarity |
| Mismatches | 5 | Maximum overall mismatches allowed in a similarity |
| Consecutive Matches | 3 | Minimum number of consecutive matches required between mismatches, insertions, or deletions |
| Consecutive Mismatches | 3 | Maximum consecutive mismatches allowed in a similarity |
| Overall Percent | .77 | Minimum proportion of any similarity occupied by matches: total matches / ( total matches + total mismatches ) |
When the search program scans the matrix, it searches down the diagonals of the matrix in search of matches and ultimately similarities. When it has found a set of consecutive matches that could be a similarity, it will follow it until the series of matches ends. At this point, it will look to see if the series of matches continues along the same diagonal after a small break, as would be the case if one chromosome has a few mismatches relative to the other at that point. However, this procedure will not find the continuation of the similarity if there is an insertion or deletion on one chromosome relative to the other at that point. In order to find this type of continuation, the program would have to step off to an adjoining diagonal to look for an additional series of matches. Such a mechanism to step off the diagonal was a deliberately not included in the effort to increase search speed to within reasonable limits for scanning the whole yeast genome.
Taking the above into account, it is possible that similarities which
harbor insertions or deletions could be split into two pieces which would,
separately, fail to meet the minimum number of matches. One mechanism
used to counteract this phenomenon was to set the default parameters such
that the program records even relatively small similarities, or ones with
many mismatches. This results in the program being able to independently
locate the various parts of larger similarities that contain insertions or
deletions. These then show up in the viewer image as separate line
segments near one another, allowing meaningful patterns to be picked out
visually. One unfortunate side effect is that the tactic of recording
even relatively small similarities inevitably increases the level of small
noise similarities which occur widely in the matrices.
After an entire matrix is scanned this way for similarities, the program uses another mechanism to help compensate for not stepping of the diagonal at a miss. This mechanism searches for situations where one simility ends on one diagonal next to where another similarity begins on an immediately adjacent diagonal (as would happen in the case where a similarity on one chromosome has an insertion or deletion of one base pair relative to the other chromosome) and joins any such pairs end to end. All other similarities remain independent.
Once the matrix is finished being searched, the search program records the locations of all the similarities that met all the criteria in a file. GDSV and GSV then read in the locations of similarities from this file and creates the graphic images seen in your browser. Similarities that fit the given criteria are shown as diagonal lines indicating the location on each chromosome where the match occurred. Even if a similarity contains mismatches, insertions or deletions, it will still appear as an uninterrupted solid line in the viewer. Therefore, while the length of similarity seen in the view might indicate a region of pure similarity, it may also contain up to the specified number of mismatches plus the specified number of insertions or deletions. See Table 1 for the exact number. Blue lines indicate direct repeats and green lines indicate inverted repeats.
Go to Genome-wide
DNA
Similarity View