In this section, we outline the algorithm used to compute the composition of database sequences and to apply composition-based statistics in TBLASTN. Then we further describe the tests reported in this paper: the executables used, the tests sets, and details about the methods.
Compositional adjustment in TBLASTN
The BLAST heuristics[2] use a general scoring system, such as the PAM[38, 39] or BLOSUM[10] series of matrices, to discover database sequences likely to align to the query and likely starting points for alignments. In BLAST, an alignment is known as a high-scoring pair, or HSP[40]. A list of HSPs for each significant query-subject pair is created using a multi-stage algorithm. At each stage, HSPs may be culled from the current list for a number of reasons, including having insufficiently high score, being contained in a higher-scoring HSP, or sharing an endpoint with a higher-scoring HSP. As a result, while each successive stage of the BLAST algorithm requires significantly more computation for each HSP, fewer HSPs need be considered.
Compositional adjustment, whether used by TBLASTN or other modes of operation, is applied only in the final stage of a BLAST search. In this fashion, modes that use compositional adjustment apply the fast heuristics of BLAST to locate regions likely to contain, and starting points likely to lead to, high-scoring alignments. They apply compositional adjustment only before the most sensitive and most computationally expensive alignment algorithm, the computation of a gapped alignment that includes information specifying the locations of gaps, information known as the "traceback". The list of HSPs produced by this final gapped alignment, after being filtered for insufficiently significant or redundant HSPs, is the list presented to the user.
Steps for applying compositional adjustment
At a high level, the steps of composition adjustment, applied individually to each query-subject pair, are as follows: (1) compute windows of interest using the list of HSPs from preliminary stages of the BLAST algorithm; (2) obtain translated subject data for the windows and filter it to remove uninteresting subsequences; (3) compute the composition of the subject region for each HSP to be realigned; (4) compute a scoring matrix for each HSP to be realigned, based on the composition of the subject region of that HSP and on the composition of the query; (5) perform a gapped alignment with traceback to recompute the list of HSPs, using the new scoring matrices. In practice, these high-level steps are interleaved to reduce memory requirements.
Computing windows of interest
For each match between the query and a subject sequence, the compositional adjustment algorithm is given a separate list of HSPs. Each HSP specifies, along with other information, a range in the subject sequence that has been aligned to the query. These ranges are used as follows to compute a list of windows. First, a preliminary list of windows for the subject sequence is created. This list contains one window for each HSP, the window that surrounds the subject range of the HSP, including 600 bases to the left and right of the subject range if that much sequence data is available. Then a final list of windows is created by joining windows in the same translation frame if they touch or overlap. For each window, a list of HSPs corresponding to the window is maintained.
Obtaining and filtering subject data
The nucleotide subject data within a window is obtained and translated using that window's translation frame. The SEG[9] algorithm with window size 10, low-cutoff 1.8, and high cutoff 2.1 is used to mask low-complexity regions in the subject window. The parameters were chosen as a result of the study[4]. A low-complexity region is typically dominated by a few distinct residues often, but not always, in a repetitive pattern. Typical examples are polyglycine or polyproline monomers. Alignment scores that include the scores of low-complexity regions tend to overstate the significance of the alignments and lead to many false positive matches.
The effect of applying the SEG algorithm to an amino acid sequence is to replace each residue in a low-complexity region with the X character: a character that is assigned a small negative score when aligned to any character, including itself. The subject data are filtered before compositionally adjusted scoring matrices are computed, and occurrences of the X character are ignored when computing the composition of a sequence. Unlike the composition-adjustment code, the preliminary stages of the BLAST search do not filter the subject data.
SEG filtering may also be applied to the query sequence. SEG filtering of the query is a command-line option for both BLASTP and TBLASTN. The programs differ in that SEG filtering of the query is off by default in BLASTP but on by default in TBLASTN. We did not filter the query in any results reported in this paper. The SEG parameters used to filter the subject sequence apply a higher threshold for declaring a region to be low-complexity than do the default parameters used to filter the query. The reason that the query sequence is more stringently filtered is that the query sequence is used at every stage of the BLAST algorithm. SEG filtering of the subject only occurs at the final stages of a BLAST search, and under-filtering the data within a subject window will effect only a single comparison.
Computing the composition of the subject
For TBLASTN, the sequence data and the subject ranges of the HSPs within a window are used to determine a range likely to contain correctly translated amino acid data. The window is searched strictly to the left of the subject range of the HSP to find the rightmost occurrence of a stop codon. If one is found, then the location 20 characters to the right of the stop codon is the left boundary of the composition range, with the restriction that the entire subject range of the HSP be included. If no stop codon is found, then the left endpoint is the left endpoint of the window. The symmetric rule is applied to the right.
The intent is not necessarily to locate the stop codon that terminates the protein, but rather to use the presence of a stop codon to indicate that the hypothetically translated codon lies in a noncoding region. Indeed, the noncoding region may be an intron rather than the true end of the amino acid sequence. Because we are not attempting to find a terminating stop codon, we propose a symmetric rule to determine the sequence range to use for composition adjustment even though biological translation is asymmetric.
In a random DNA sequence with 50% GC content, one would expect to find a stop codon in a hypothetically translated amino acid sequence on average once every 21 characters. Therefore, we institute a 20 character margin between the stop codon and the range to use for composition adjustment, with the restriction that the entire subject range of the HSP be included.
Given a particular region, TBLASTN considers only the 20 standard amino acids when computing composition; the X character, the stop character, and all other nonstandard characters are completely ignored. When the length of the sequence is used in the compositional adjustment algorithms, the value used does not count occurrences of ignored characters.
Computing compositionally-adjusted scoring matrices
Schäffer et al.[4] and Yu et al.[5] show how to adjust substitution scores for the 20 standard amino acids. For the standard amino acids, we apply those techniques. These papers do not, however, discuss the treatment of rarely occurring amino acids, two-letter ambiguity characters, the X character, or the stop character. We discuss the treatment of the X and stop characters in this section, because they occur commonly in TBLASTN searches. We discuss the treatment of the other characters in Additional file 3.
The stop character occurs frequently in translated sequences and occasionally within significant alignments. An occurrence of the stop character usually indicates that one is translating a noncoding region or a coding region in the wrong frame. Of course, a stop character can also simply mark the end of translation. However, stop characters occur within significant alignments for several reasons: the subject sequence may contain a pseudogene; the subject sequence may be mitochondrial DNA, in which certain codons that are stop codons in nuclear DNA are translated to true amino acids [41–43]; the subject sequence may contain a stop codon that is converted in vivo to a selenocysteine[44, 45] or pyrrolysine[46] residue; the subject sequence may represent a gene, such as the hdc gene in D. melanogaster[47, 48], that encodes a protein product by mRNA readthrough; or there may be a sequencing error in the subject sequence.
Appropriate scoring of the stop character is essential to TBLASTN. Any character aligned to a stop character should be given a negative score, but not a negative score of such large magnitude as to disallow valid alignments containing a stop codon. BLAST uniformly assigns letters aligned to a stop codon an integral score that, given the scale being used, is as close as possible to -2 bits.
As just discussed, biologically meaningful and statistically significant TBLASTN alignments may sometimes contain translated stop codons. However, the presence of many stop codons in noncoding regions and out-of-frame coding regions renders it very unlikely that these regions will yield high-scoring alignments by chance. Accordingly, for E-value calculations, TBLASTN assumes the length of a database sequence to be the length of the protein yielded by translation in a single reading frame, even though translation is in fact performed in all six reading frames. That many database DNA sequences are noncoding over much of their lengths may be one explanation for the generally conservative statistics of S-TBLASTN and C-TBLASTN shown in Figure 1.
Because of the application of the SEG algorithm, the X ambiguity character is common, and the treatment of X characters can significantly effect the performance of the algorithm. We score alignments with X as follows. When either compositional matrix scaling or compositional matrix adjustment is used, substitution scores are computed for all standard amino acids. Then, for all variants tested here, the score of aligning a standard amino acid i in the query sequence to an X in the subject is computed using the formula:
where is the set of standard amino acids and is the probability of amino acid j in the subject sequence. In other words, the score of matching a standard amino acid with X is the expected value over all matches of that amino acid with a standard amino acid, provided that this value is less than -1. For B-TBLASTN and S-TBLASTN, is the actual frequency of the amino acid in the subject region; for C-TBLASTN, the probabilities are computed using pseudocounts, as described in [6]. A formula analogous to Equation(1) is used to compute the score of aligning an X character in the query to a standard amino acid in the subject. The score for aligning X to itself is the smaller of the expected score of aligning any two standard amino acids and -1, rounded to the nearest integer.
Performing a gapped alignment with traceback
Routines that apply composition-based statistics do not merely rescore alignments, but rather recompute them. Alignments are computed using one of two techniques. By default, the x-drop algorithm [2, 49] is applied to a set of starting points specified in the lists of HSPs provided from previous stages of the BLAST algorithm[1, 2]. As a result of modifications made during the course of this project, one may alternately specify that the rigorous Smith-Waterman[50] algorithm be applied within each window. If the x-drop algorithm is applied, the composition is computed individually for each HSP that is realigned. If the Smith-Waterman algorithm is used, the composition of a window is taken to be the composition of its highest-scoring HSP. Pooling the composition of the subject regions of several HSPs within a window is problematic because the HSPs do not necessarily belong to the same alignment, or even to the same linked set of alignments. The default in TBLASTN is to use the x-drop algorithm, and we use the x-drop algorithm in the tests presented in this paper.
The following pseudocode shows how alignments corresponding to a single query-subject match are recomputed when the x-drop algorithm is used. In the pseudocode, q represents the query data, is a list of windows, is a source of sequence data, and params is a structure containing all parameters needed for gapped alignment. The variable represents the new set of alignments to be returned, and M represents a compositionally adjusted scoring matrix. The HSP_IS_CONTAINED and WITH_DISTINCT_ENDS routines will be described below; the action of the remaining routines should be clear from their names.
Algorithm 1
Redo alignments in a window.
function REDO_ONE_WINDOW (q, w, , params, cutoff_score)
←∅
H←windows.hsps
SORT_BY_SCORE(H)
s←GET_TRANSLATED_SUBJECT (w, )
for i←0 to length(H)-1 do
h←H[i]
if forall 0≤ j <i not HSP_IS_CONTAINED(h, H [j]) then
M←ADJUST_COMPOSITION (q, s, h, params)
a←CALC_X_DROP_ALIGNMENT (q, s, h, M, params)
if a.score≥cutoff_score then
←WITH_DISTINCT_ENDS (a, )
end if
end if
end for
return A
end function
The HSP_IS_CONTAINED routine returns true if the HSP provided as its first argument is contained in the HSP provided as its second argument. An HSP is considered to be contained in a second HSP if its query and subject bounds are contained in the query and subject bounds of the second HSP and if the second HSP has equal or higher score.
The WITH_DISTINCT_ENDS routine adds the alignment a to the set of alignments if and only if does not already contain an equal- or higher-scoring alignment that shares an endpoint with a. If a is added to , then WITH_DISTINCT_ENDS filters to remove any lower-scoring alignments that share an endpoint with a. In this fashion, repeatedly calling the routine WITH_DISTINCT_ENDS ensures that the final list of alignments does not contain an alignment that shares an endpoint with a higher-scoring alignment. When two alignments share the same endpoint, the higher-scoring one is the preferred alignment; the lower-scoring alignment is a suboptimal artifact of the BLAST heuristics.
The x-drop algorithm requires a starting point (p
q
, p
s
) that will force an alignment between offset p
q
in the query and p
s
in the subject. It computes an alignment in both directions starting from this point. A starting point is defined for each HSP that is realigned. If possible, the starting point that was originally used to compute the HSP is reused. Due to the effects of SEG filtering and the newly computed scoring matrix, however, the previous starting point may no longer be desirable; it may lie in a region of nonpositive score. We discuss the rule used to validate the existing starting point, and if necessary choose a new one, in Additional file 3: tblastn_suppl.pdf.
Finally, we remark that Algorithm 1 is also correct pseudocode for BLASTP, which performs protein-query, protein-database searches. The difference is that for BLASTP there is only one window for each subject sequence: the window that includes the entire sequence. Moreover, for BLASTP the composition of the entire subject sequence is always used when performing compositional adjustment. Therefore, the compositionally adjusted matrix is necessarily the same for each HSP in a window and need only be computed once. In practice, the same code is used for both TBLASTN and BLASTP to implement Algorithm 1, but for BLASTP a conditional is used to ensure the matrix is only computed once for each window.
Test sets and programs used
We describe below the specific executables, data sets, and methods used to generate the results presented in this paper. The variants of TBLASTN reported here were written in C, and, as noted below, some variants are available as part of the NCBI C and C++ software distributions; the computational modules involved are mirrored between the two distributions. Numerous auxiliary programs used to automate testing and summarize results were written in C, Perl, Python, and Bourne shell script.
Executables used
TBLASTN is a mode of operation for the blastall executable. This executable is available for download from[36]. The C-TBLASTN and S-TBLASTN variants are available as a set of options to the blastall executable. S-TBLASTN is invoked using the command-line options "-p tblastn -F F -C 1". C-TBLASTN is invoked using similar options, but with "-C 1" replaced by "-C 2". B-TBLASTN is not currently available as a set of command line options. TBLASTN may be run without composition-based statistics, by omitting the "-C" option, but the default version runs with lower precision than B-TBLASTN. Executables that run B-TBLASTN and the specific versions of S-TBLASTN and C-TBLASTN used in this paper are available for download at[51].
The blastall executable by default uses BLOSUM62 to perform alignments of amino acid sequences, and this is the matrix used in all stages before composition adjustment is performed. The "-F F" option disables SEG filtering of the query sequence. SEG filtering of the subject sequence is on by default in any of the composition adjustment modes. We consider filtering both sequences to be unnecessary; when we tried filtering both sequences, we saw no improvement in statistical accuracy, but did see a decline in the ROC scores (data not shown).
Tests using randomly permuted queries
To measure how effective composition-based statistics is at eliminating false matches with low E-value, we performed a series of tests using randomly permuted amino acid sequences from the mouse (Mus musculus) genome. One thousand protein sequences were randomly selected from the list of RefSeq[52] mouse proteins current on January 10, 2006. Sequences were permuted using their GenBank identification number as a seed to a random number generator. The permuted sequences are provided as Additional file 1.
We aligned the permuted sequences to a database of chromosomal sequences from the reference assembly of build 35 of the human (Homo sapiens) genome, released August 26, 2004. The database includes chromosomes X and Y and the unplaced sequence fragments included in the build. We omitted the mitochondrial genome from the database, however, as these sequences are known (see[42]) to have a different genetic code than nuclear DNA.
ROC score tests on the yeast genome
To test the effectiveness of various modes of composition adjustment for TBLASTN, we performed a number of tests using the yeast nuclear genome. We downloaded the yeast genome from[53], a site containing reference genomes curated by NCBI staff. The version of the genome that we used was created on May 16, 2005.
We aligned a set of 102 protein domains to the yeast nucleotide genome using TBLASTN. This test set was first developed for the study in [11]. An updated version was used in [4], in which a human curated list of true positive matches to the yeast proteome was used to generate ROC scores. For the tests described here, we updated the true positive list to reflect changes in the published yeast genome. The updated list contains 987 query-subject matches to 894 distinct subject sequences. The version of the test set used in this paper is provided as Additional file 2.
In the yeast genome, each known yeast protein is annotated with the location and strand of its coding region. These annotations allow us to adapt the test set for use with TBLASTN as follows. For TBLASTN, alignments are divided into three categories: (1) alignments that match a query to the coding region of a known true positive match; (2) alignments that match a query to a known coding region that is not a true positive match; and (3) alignments that do not match a known coding region. An alignment is said to match a query to a coding region if the subject portion of the alignment overlaps the coding region and is on the same strand.
It is not uncommon for there to be more than one alignment between a query and a coding region. Indeed this is expected; protein-protein searches also report multiple alignments between pairs of proteins. When there is more than one alignment to a coding region, only the lowest E-value alignment between a particular query and the coding region is used when computing ROC scores. No attempt is made to apply a similar rule to noncoding regions. All alignments that do not overlap a coding region are categorized as false positive matches and counted when computing ROC scores.
We made two explicit exceptions to this scheme for classifying hits. The first exception is to add a particular pseudogene (Entrez Gene ID 850644) to our list of coding regions and to make the pseudogene a true positive for one of our queries, raising the maximum possible number of true positives to 988. Each of the variants tested found an alignment to this pseudogene with E-value smaller than 10-12. The pseudogene is expressed and produces a functional protein under certain conditions [54–56]. Though this region is labeled as a pseudogene, we do not believe an alignment algorithm should be expected to distinguish it from a true gene. The second exception is to categorize a particular alignment that overlaps one true positive coding region and one false positive coding region as a true positive match. This overlap is reported by all three variants of TBLASTN.