Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST

Background TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. Results We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. Conclusion TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms.


Supplemental Information
Treatment of nonstandard characters The amino acid alphabet used by BLAST has 28 characters, most of which may be represented in text by uppercase Roman letters. The alphabet contains characters that correspond to the 20 amino acids present in the known genetic codes. We refer to the remaining eight characters as nonstandard characters. The two nonstandard characters that occur often in TBLASTN searches are the stop character and the X character, and we discuss the treatment of these characters in Methods. The alphabet contains a nonstandard character that represents a gap in an alignment, but this character should not occur in any input sequence to TBLASTN. While none of the remaining five characters may occur in a translated sequence, some may occur in an amino acid sequence used as a query. Therefore, scores for alignments involving these characters must be defined.
Two of the nonstandard characters are true amino acids that, while not represented in any known genetic code, are inserted into certain proteins as part of translation of mRNA. These amino acids are selenocysteine [1,2], represented by the character U, and pyrrolysine [3], represented by the character O. Selenocysteine and pyrrolysine residues are rare but are biologically important when they occur. Because the amino acids are so rare, it is difficult to assign aligned amino acid pairs involving these characters a meaningful score. BLAST chooses to, in effect, filter these characters from sequences; the score for aligning any character to either O or U is precisely the same as the score for aligning the character to X.
The remaining three characters represent ambiguity between pairs of amino acids: B represents an ambiguity between aspartic acid and asparagine; J represents an ambiguity between leucine and isoleucine; and Z represents an ambiguity between glutamic acid and glutamine. These characters may occur in the query amino acid sequence if the method used to obtain the sequence could not distinguish between two amino acids. For instance, mass spectrometry cannot distinguish leucine from isoleucine.
S-TBLASTN simply scales the scores for alignments involving two-letter ambiguity characters in exactly the same fashion as it scales the scores of the standard amino acids. C-TBLASTN computes scores for B, J, and Z using the target frequencies and background probabilities computed while opti-mizing the scores of the standard amino acids. Similar formulas are used for each of B, J, and Z; we only describe the formulas used when calculating scores involving B here.
Let Q ij , P i , and P i be calculated as in [4] for all standard amino acids i and j. Q ij represents the target frequency of substituting amino acid i in the query with amino acid j in the subject. P i and P i represent the background frequencies of amino acid i in the query and subject sequences, respectively. Let λ be a statistical parameter that gives the scale of the scoring system. The score of aligning B in the query to standard amino acid j in the subject is where D represents aspartic acid, and N represents asparagine. The score for aligning B in the query with X is computed using the formula If j is one of the characters B, J, or Z, representing an ambiguity between characters k and , then score(B, i) BLAST uses integer scores, so the results of equations (S1)-(S3) are rounded the nearest integer before they are used. Occurrences of B in the subject sequence are treated analogously. We remark that the characters J and O are recent additions to the alphabet used by BLAST and may not be supported in older versions of BLAST or in all modes of operation.
Determining a starting point for a gapped alignment All variants of TBLASTN discussed in this paper apply SEG filtering to the translated subject sequence before applying compositional adjustment. Earlier stages of the BLAST algorithm, and in particular the stage that generates starting points for gapped alignments, do not filter the subject sequence. Once the subject sequence has been filtered, a starting point may no longer be desirable because it lies in a region that has been overwritten with Xs. Furthermore, compositional adjustment itself can effect the quality of a given starting point. Therefore, TBLASTN must test each existing starting point to determine whether it is an acceptable location to start a gapped alignment, and if not must compute a new starting point.
To determine whether a starting point from previous stages of the BLAST algorithm is acceptable, TBLASTN uses the compositionally adjusted scoring system to calculate the score of an ungapped alignment that contains the starting point. Usually, the ungapped alignment extends five positions to the left and five positions to the right of the starting point, but it may be shorter if there is not enough sequence data to extend that far. If the score of this ungapped alignment is positive, then the starting point is acceptable and is used to recompute a gapped alignment.
If the starting point is not acceptable, then TBLASTN computes a new starting point using the algorithm outlined in the following pseudocode. In the pseudocode, q and s represent the query and subject data, respectively. The interval [ q , r q ] is the range of a particular HSP in the query, and [ s , r s ] is its range in the subject. M is a compositionally adjusted scoring matrix and ungapped score(M , q, s, i, j, n) is a function that computes the score of an ungapped alignment of length n starting at (q[i], s[j]) .
Algorithm 2. Find a starting point for gapped alignment.
function find gapped start(M , q, q , r q , s, s , r s ) max score ← 0; max index ← 0 length ← min{r q − q + 1, r s − s + 1} for i ← 0 to length − 11 do S ← ungapped score(M , q, s, i + q , i + s , 11) if S > max score then max score ← S max index ← i + 5 end if end for return (max index + q , max index + s ) end function Algorithm 2 computes the score of several ungapped alignments of length 11, and if any of these has positive score, then it chooses the midpoint of the highest scoring alignment to be a new starting point for gapped alignment. It can, and in practice does, happen that none of the alignments tested has positive score. In this case, the left endpoint of the HSP itself is used as the starting point for gapped alignment.