Genomic sequence reveals a complex mitochondrial genome
Crypthecodinium cohnii
Previously reported C. cohnii cox1 sequences indicated multiple copies of the gene with different flanking sequences [15]. To test if this genomic complexity extends to other C. cohnii mitochondrial genes, we sequenced multiple genomic clones containing cob and/or cox3. A library of EcoRI restriction fragments constructed from a fraction enriched in mtDNA was screened using a C. cohnii cob gene probe, obtained by PCR. This screen recovered a cob clone linked to a 57-bp cox3 fragment, which itself was used to probe for cox3-containing clones. In total, 14 clones were characterized (11 cob, two cox3 and one containing both), ranging in size from 2.5 kb to 5.4 kb (eight clones were 3.7 kb long). End sequencing and restriction mapping identified six unique cob-containing clones, and three unique cox3-containing clones. Four clones were completely sequenced (Figure 1).
The largest clone, pc3#2.2 (5.4 kb), contains a complete or nearly complete cob gene (see below), followed by three other identifiable sequences: a 49-bp stretch identical to a sequence previously found in a cox1-containing clone [15]; a 113-bp cox3 segment; and a 99-bp large subunit (LSU) rRNA sequence corresponding to mitochondrial LSUG in apicomplexans [14]. Two additional cob clones were sequenced, pcb#7 (3.7 kb) and pcb#2 (3.2 kb). Both encode cob, but with different flanking sequences than in pc3#2.2. pcb#2 contains unique 3' sequence immediately after the cob repeat, whereas pcb#7 contains additional common sequence with pc3#2.2 for ~1 kb before unique sequence occurs (Figure 1). Amongst these clones, we observed two different 5'-flanking sequences and three different 3'-flanking sequences (Figure 1). This arrangement recapitulates the organization of cox1 in C. cohnii mtDNA [15], i.e., a central repeat (1072 bp) containing most of the cob ORF) flanked by different arrays of unique upstream and downstream sequences. Partial sequencing of the remaining clones revealed an additional unique 5'-flanking sequence (in pcb#8) and one additional unique 3'-flanking sequence (in pcb#4 and pcb#9) in the immediate vicinity of the cob ORF (data not shown).
Of the three cob-containing clones described above, only pcb#2 encodes a complete cytochrome b (Cob) protein (see below). pc3#2.2 and pcb#7 share an alternative 3' sequence that predicts a Cob C-terminal sequence lacking 24 amino acid residues compared with the pcb#2-predicted Cob as well as the corresponding Plasmodium falciparum Cob. This suggests that the pc3#2.2 and pcb#7 Cob ORFs represent pseudogenes. Variable 3' coding sequences were also seen previously for C. cohnii cox1, with some coding sequences also truncated compared to other dinoflagellate sequences [15].
One cox3-containing clone (pc3#5) was also sequenced, but it was found not to encode an intact cox3 gene. Instead, this clone encoded 1339 bp identical in sequence to the portion of pc3#2.2 that included the 113-bp cox3 segment and the 49-bp cox1 sequence (Figure 1). This clone was also flanked by unique sequences, providing further evidence that mitochondrial genes occur in multiple genomic contexts in C. cohnii.
To further investigate the arrangements and relative numbers of mtDNA elements, Southern hybridization analysis was performed using region-specific probes. As shown in Figure 1, probes were generated specific to: the cob coding sequence ('cob'); two cob 3'-flanking regions ('cb1', specific to pc3#2.2 and pcb#7; and 'cb3', specific to pcb#2); the cox3 sequence ('cox3'); and the rRNA sequence LSUG ('rnl'). These probes were hybridized against a mtDNA-enriched fraction hydrolyzed by EcoRI. With the 'cob' probe, a strong signal was detected at 3.7 kb and weaker signals at 4.8, 4.5, 3.5, and 3.0 kb (Figure 2). This result is consistent with dominant EcoRI clones being 3.7 kb, and with multiple genomic contexts for cob. Probing with 3' flanking sequence 'cb1' revealed a similar banding pattern to that generated by the 'cob' probe, indicating that this region is typically contiguous with the cob coding sequence. Probing with 'cb3' presented a very different profile, with 10 bands ranging in size from 3.7 to 0.5 kb and of varying intensity (Figure 2). The cb3 sequence evidently occurs in numerous EcoRI fragments, some without cob. Probing with 'cox3' and 'rnl' also revealed multiple bands with varying intensity (Figure 2), again indicating that these mtDNA elements are present in several different genomic arrangements. Together these Southern data verify the existence of multiple copies of C. cohnii mtDNA elements occurring in different contexts, and indicate that up to 10 different arrangements occur for some of these elements.
Karlodinium micrum
Putative mitochondrial genes were identified from a survey of 16544 K. micrum expressed sequence tag (EST) sequences assembled into 11903 unique clusters [22]. Oligoadenylation of mitochondrial gene transcripts is known from other organisms [23, 24], and this also appears to be the case in dinoflagellates as the poly(A)-dependent K. micrum survey also contained many cDNAs for mitochondrial genes. Mitochondrial sequences were identified by homology to genes in other systems, and all such cDNAs were fully sequenced. Using this strategy we identified sequences representing the three protein-encoding genes found in C. cohnii: cox1 (1 cDNA), cob (11 cDNAs) and cox3 (9 cDNAs). The average A+T content of these sequences was 69% (compared to 49% for nuclear genes, calculated from all 11903 K. micrum clusters), consistent with their being encoded in the mitochondrion. We found no other mitochondrial protein-coding sequences exhibiting the strong A+T biases suggestive of an origin from mtDNA (cox2 coding sequence, for example, which is typically encoded in mitochondria but is known to have been transferred to the nucleus in dinoflagellates [25], contains 47% A+T). Several short cDNA sequences, however, with high similarity to the fragmented apicomplexan mitochondrial rRNAs [14] (see also GenBank acc. no. M76611 for updated annotation) were identified. These correspond to apicomplexan LSU rRNA fragments LSUA, RNA2, LSUE, LSUG and RNA10 (3, 1, 3, 1, and 9 cDNAs, respectively), small subunit (SSU) rRNA fragment RNA8 (9 cDNAs), and an RNA (RNA7, 7 cDNAs) that has yet to be assigned to either the LSU or SSU rRNA. While these sequences have a lesser A+T bias (56%) compared with the mitochondrial protein-encoding sequences, the high similarity of these sequences to their apicomplexan counterparts (see below), and known oligoadenylation of these transcripts in apicomplexans [23, 24], strongly implicates these sequences as additional elements of the K. micrum mtDNA.
With these 10 mtDNA tags, we used PCR to generate genomic sequences corresponding to each gene and regions linking them, with the aim of assembling large portions of K. micrum mtDNA sequence. Intergenic sequence recovered by this approach was used to provide further priming sites to extend the sampling of K. micrum mtDNA. In addition to amplification of individual genes, a total of 20 distinct gene linkage products were generated and fully sequenced (Figure 3B). This analysis yielded a sequence in which mitochondrial genes were linked to one another in many different contexts. Gene fragments were also common, as were mtDNAs with three or four distinct fragments or tandem repeats (Figure 3B). In total, cob sequences were found in at least six mutually exclusive linkages, cox3 in five, cox1 in four, LSUE in nine, RNA10 in six, RNA2 in five and RNA7 in one. Additionally, two large cDNAs (GenBank accession EF443051, 5 854 bp; and EF443052, 2153 bp) provided further evidence of multiple copies of mitochondrial genes and gene fragments linked in novel arrangements. EF443051, for example, contains the LSUG coding sequence, a second partial LSUG unit within a 170-bp repeat, the LSUA sequence, the RNA8 sequence, and an internal fragment of the cox1 gene (73 bp). These cDNAs also indicate that polycistronic transcription occurs in dinoflagellate mitochondria.
Intergenic sequences from the PCR clones were examined for additional coding elements by comparison to publicly available databases, specifically searching against K. micrum ESTs as well as comparing the intergenic regions to one another. No identifiable genes were found, but one cDNA sequence (GenBank accession EF443049) was represented in one mtDNA clone, implicating this sequence as an additional transcriptional unit of the mitochondrial genome (Figure 3B, xvi). Comparison of intergenic sequences to one another revealed numerous dispersed repeated sequences with either 100% or very high degrees of identity (Figure 3B, dashed lines). Overall, data from K. micrum are consistent with those from C. cohnii, both pointing to a complex genome organization evidently underpinned by a high level of recombination within dinoflagellate mitochondria.
Inverted repeats in mtDNA
Previous analysis of C. cohnii cox1 identified many short inverted repeats in flanking, non-coding sequences [15]. We have applied a similar analysis to the C. cohnii cob- and cox3-containing sequences, as well as the K. micrum mtDNA data, and find a very similar pattern of repeat features, although we also note some differences between the two taxa. Within the C. cohnii sequences, we screened for inverted repeats of different length and distance between them, and found two distinct but prevalent classes of this element type. The first class is similar to those previously described [15], and consists of very closely spaced, small inverted repeats (> 6 nucleotides and no more than 5 nucleotides apart). These inverted repeats occur almost exclusively within non-coding sequence, with the only exceptions being at the very extremities of genes (Figure 1, vertical dashes). A second class of inverted repeats consists of longer repeat elements (> 9 nucleotides) no more than 50 nucleotides apart. Such inverted repeats are also prevalent in C. cohnii mtDNA, and are almost exclusively features of the non-coding sequences (Figure 1, small circles).
Analysis of K. micrum mtDNA showed that inverted repeats are also a feature of intergenic sequences; however, in this case only the larger class of inverted repeats was found, with none of the smaller, closely spaced inverted repeats occurring in any of the mtDNA sequences (Figure 3). Again these repeats are almost exclusively located within intergenic regions, with genic inverted repeats only occasionally present, within gene extremities. No equivalent inverted repeats were found in a random sample of 10 K. micrum nucleus-encoded gene sequences (10630 nucleotides total). The sequences of repeated elements in both C. cohnii and K. micrum are consistent with secondary structures such as stem loops and hairpins, and in both cases the repeated elements that could form such stem structures are typically G+C rich, in spite of the A+T bias of these organelle genomes. The inverted repeats described here are also distinct from secondary structural elements of the rRNAs (see below) that typically consist of imperfect inverted repeats. Densely packed inverted repeats, primarily in intergenic regions, was also recently described from A. carterae mtDNA [20]. In this case, imperfect inverted repeats were predicted to form stems of 50–150 nucleotides, with AT-rich loops of ~10–30 nucleotides. While inverted repeats therefore appear to be a consistent feature of dinoflagellate mitochondrial genomes, the elaboration of these elements is variable between taxa, with shorter repeats only present in C. cohnii.
Mitochondrial gene transcripts lack stop and start codons
Extensive substitutional RNA editing of transcripts occurs in dinoflagellate mitochondria, so exactly where an open reading frame begins and ends can only be tentatively inferred from genomic DNA. Accordingly we used K. micrum cDNAs, and publicly available mRNA sequences from several other dinoflagellates, to identify the ends of all three protein-coding genes.
Absence of stop codons
Oligoadenylation of transcripts apparently occurs upstream of any canonical stop codon in all protein-encoding transcripts analyzed, and for only one gene does oligoadenylation create an in-frame canonical stop codon. This lack of encoded stop codons applies to transcripts for cob, cox3 and cox1 represented from multiple species. All 11 cob transcripts from K. micrum are oligoadenylated at the same point, which corresponds to the expected C-terminus of Cob homologues (Figure 4), but does not include an in-frame stop. The 3' ends of transcripts from four other dinoflagellates (P. piscicida, Prorocentrum minimum, G. polyedra, A. carterae) are oligoadenylated at precisely the same position (Figure 4). For cox1, the mRNA sequences from four taxa (P. minimum, P. piscicida, A. carterae, and Karenia brevis) are all oligoadenylated at the same position, where the protein sequence is predicted to terminate (Figure 4); once again, none of these encode a stop codon.
The K. micrum cox3 cDNAs present an even more interesting situation. Five of nine cDNAs are oligoadenylated approximately 40 codons upstream of the predicted C-terminus, and without an in-frame stop codon (Figure 4). However, another four cDNAs are oligoadenylated a further 129 nucleotides downstream; these cDNAs encode amino acid sequence with high similarity to the C-terminus of Cox3. In this case, oligoadenylation follows a U residue creating an in-frame UAA stop codon. The generation of an in-frame stop codon concomitant with oligoadenylation is also apparent in Amphidinium cox3 mRNA; however, as in K. micrum, other cox3 Amphidinium transcripts are oligoadenylated prematurely, within a few bases of the premature oligoadenylation site in K. micrum cDNAs (Figure 4). Alternative oligoadenylation sites have also been reported for cox3 transcripts in the dinoflagellate G. polyedra [16].
A potential alternative stop codon was sought among these transcript data by looking for a codon that occurs exclusively in the 3' region of these coding sequences. However, no such candidate codon could be identified either within or between the taxa surveyed, nor is there any evidence for use of a non-standard genetic code (with the possible exception of start codons, see below). Moreover, oligoadenylation consistently occurred at the position where the protein sequence is expected to terminate, leaving little or no apparent untranslated region (UTR).
Alternative start codons
Dependence on a standard ATG start codon also is apparently relaxed in dinoflagellate mitochondria. From multiple dinoflagellate species mRNAs for the three protein-coding genes extend beyond conserved N-termini, suggesting these transcripts are likely to be full length, but all lack a plausible N-terminal AUG (Figure 4). Existing genomic sequences corroborate the lack of initiating ATGs.
Transcript data for cox3 from three species (K. brevis, K. micrum and G. polyedra) and cox1 from K. micrum are all apparently full length based on protein alignments and all lack an AUG in the terminal region (Figure 4). The corresponding genomic region upstream of K. micrum cox1 does not contain an in-frame ATG until 615 nucleotides upstream of the conserved sequence, and 11 stop codons fall between them, supporting the likely absence of an ATG from this gene. Genomic sequences for C. cohnii cox1, however, do contain an in-frame ATG ~13 codons upstream of N-terminal sequence conservation seen among dinoflagellates. While it is possible that this particular ATG serves as the initiator codon in this taxon, the lack of any sequence conservation with the corresponding K. micrum sequence within this 13-residue stretch (Figure 4) suggests that this might also represent a chance ATG within the 5' UTR.
K. micrum cob mRNAs do encode an AUG close to the site where sequence conservation with other Cob proteins begins, but on close inspection there is conserved sequence upstream of this codon (Figure 4). Further, cob from the early-diverging member of the dinoflagellates, Oxyrrhis marina, lacks this AUG or any other upstream of this region [21]. In mRNAs of all other available species (K. micrum, K. brevis, and P. piscicida) there is strong conservation of the four predicted amino acid residues upstream of this ATG (F, V/L, L, L), further suggesting that translation likely initiates upstream of it (Figure 4). The conservative change of this second residue, V to L, among dinoflagellate taxa (and V to I in the genomic sequence for C. cohnii) supports the inference that this region likely represents protein-coding sequence rather than UTR. Some conservation of this sequence with Plasmodium Cob is also apparent (Figure 4). None of the four apparently full-length K. micrum cob genomic sequences encodes an additional ATG codon between this region of conservation and the next in-frame stop codon (Figure 4), and the same situation is seen in a P. piscicida cob sequence. The C. cohnii genomic sequences are the only cases to date where potential ATG codons do occur in this upstream sequence (Figure 4). However, two of these occur well upstream of any 5'-sequence conservation among dinoflagellates, and would represent unusually long (5'-extended) and divergent Cob proteins in these cases (Figure 4).
Trans-splicing of cox3
Included among the K. micrum cox3 cDNAs were four inferred to be full length (839 nucleotides) based on protein alignments (Figure 4), and five inferred to be prematurely oligoadenylated at nucleotide 712. Despite the fact that the longer cDNA is likely the functional cox3 mRNA, a genomic copy corresponding to it could not be amplified from genomic DNA using multiple primer combinations (all of which successfully amplified the corresponding fragments in RT-PCRs; data not shown). The longest product obtained from genomic DNA corresponded to nucleotides 50–712 of the full-length cox3 sequence. Six genomic fragments containing cox3 sequence were obtained by amplifying between genes, and these suggest that the gene is fragmented in the genome (Figure 3B, xv, xvi, xvii, xviii, xix and xx). Notably, three unique cox3 genomic sequences are truncated at nucleotide 712, precisely where the short cox3 transcripts are oligoadenylated (Figure 3B, xv, xvi and xx). Immediately downstream is a stop codon, and subsequently no further sequence similarity to cox3. Similarly, the only genomic sequences found to encode the 3' end of the long transcript are 5'-truncated at nucleotide 718, with sequence unrelated to cox3 upstream of this point (Figure 3B, xvii and xviii). Taken together, these data suggest that the long cox3 transcript is the product of trans-splicing, where nucleotides 1–712 are joined to nucleotides 718–839 arising from two different genomic fragments. The intervening five nucleotides (713–717) are all A residues in the full-length cox3 transcript, suggesting that trans-splicing occurs within the oligo(A) tail of the upstream transcript.
Mitochondrial rRNAs are fragmented in a similar pattern as in apicomplexans
SSU and LSU rRNAs are encoded in all characterized mtDNAs; however, until recently [17] no mitochondrial rRNA sequences had been described from dinoflagellates. In this study we have identified several discrete, short sequences with strong similarity to components of the highly fragmented rRNAs of apicomplexans [14] (GenBank acc. no. M76611). From K. micrum, we obtained cDNA sequences representing five LSU rRNA fragments (LSUA, RNA2, LSUE, LSUG, and RNA10), one SSU rRNA fragment (RNA8), and one unassigned rRNA fragment (RNA7), all of which correspond to known transcriptional units of the Plasmodium mitochondrial genome. We also identified an additional LSU rRNA fragment, LSUF, as well as LSUE and LSUG, from an EST survey we previously conducted in Heterocapsa triquetra [26]. Alignment of LSUA, LSUE, LSUF, LSUG and RNA10 to their Plasmodium LSU homologues is shown in Figure 5. SSU rRNA fragment RNA8 and unassigned fragment RNA7 share 66% and 74% sequence identity to Plasmodium homologues, respectively. For each fragment, multiple cDNAs were sequenced (with the exception of RNA2 and LSUG), and oligoadenylation was found to occur at a consistent site (Figure 5). Although these cDNAs are all relatively short, the 5' ends could not be definitively determined from these cDNAs because the 5'-lengths were variable. Further, genomic copies (where they are known) encoded conserved sequence upstream of the 5' ends of cDNAs of LSUE and LSUG (Figure 5).
For C. cohnii, the LSUG sequence identified on EcoRI clone pc3#2.2 was analyzed by 3' RACE and the site of oligoadenylation was shown to be identical to that in the corresponding K. micrum and H. triquetra cDNAs (Figure 5). Northern analysis of C. cohnii RNA showed a single LSUG-positive band at ~108 nucleotides [27]. This size corresponds well with the limit of conservation among LSU rRNA sequences, as well as the size of the Plasmodium LSUG. C. cohnii LSUE was also amplified and the ends determined by 5'-cDNA sequencing and 3' RACE (Figure 5). Northern hybridization against mitochondrial RNA confirmed the presence of an ~200 nucleotide RNA species [27].
The oligoadenylation sites for mitochondrial rRNA fragments are identical among dinoflagellates, and either identical or within a few nucleotides of those observed in Plasmodium (Figure 5). The 5' ends of these sequences, whether defined experimentally (LSUE and LSUG from C. cohnii) or by sequence conservation, are also very similar to those of their Plasmodium counterparts. The only possible exception is K. micrum RNA2, where the sole cDNA obtained contained substantial upstream (305 nucleotides) and downstream (79 nucleotides) sequence compared to the region with similarity to Plasmodium RNA2. However it is possible that this cDNA represents an unprocessed precursor, and accordingly further work is required to substantiate the size of this putative rRNA fragment. Secondary structure predictions for dinoflagellate sequences LSUA, LSUE, LSUF, LSUG, RNA10 and putative RNA2 (limited to the region of similarity to the Plasmodium RNA2) all indicate that the expected folding and intermolecular base pairings occur (Figure 6), and these fragments are likely to contribute to a viable reconstituted LSU rRNA, as for Plasmodium.
RNA editing
Protein-coding genes
RNA editing has been described for cox1, cob and cox3 transcripts from diverse dinoflagellates, including the cob mRNA of K. micrum [18–20]. Comparison of K. micrum cDNA and corresponding mtDNA sequences for the three genes identified here confirms this conclusion for transcripts of cob, and further shows that cox1 and cox3 transcripts are also edited. The average density of editing of the cox1 transcripts is one substitution per 36 nucleotides and this value is consistent with other studies in different species [18, 19]. By contrast, editing in cox3 transcripts is over twice as dense, at one substitution per 17 nucleotides, making cox3 the most heavily edited gene transcript in dinoflagellates. Editing of cob mRNA lies in between these extremes, at one substitution per 25 nucleotides.
In the case of cox1 transcripts, four types of substitutional changes were detected at 42 sites. Of these, 48% were A to G substitutions, followed by U to C (21%) and smaller proportions of C to U and G to C edits (17% and 14%, respectively). This observation is consistent with cox1 mRNA editing occurring in other species, where most (80%) of the reported changes are A to G and U to C substitutions [18, 19]. So far, G to C changes have only been observed in mtDNA-encoded mRNAs of dinoflagellates, whereas A to G changes have only been reported in nucleus-encoded mRNAs. cox3 mRNA editing types are generally consistent with those observed in cox1 and cob mRNAs. Five types of substitutional changes were observed at 50 sites, of which 42% were A to G changes, followed by C to U and U to C edits (28% and 22% respectively), as well as three G to A edits (6%) and a single G to C edit (2%). For both cox1 and cox3 mRNAs, the majority of substitutions occur at the first or second positions of affected codons (88% and 96%, respectively), and over 90% of editing events result in a change in predicted amino acid. In K. micrum cox3 mRNA (and cox1 and cob mRNAs of other dinoflagellates [18, 19]), editing also removes a UAG codon, which is typically a stop codon but is apparently unassigned in dinoflagellates.
Analysis of the 20 cDNAs corresponding to cox3 and cob offers further insight into the process of RNA editing in dinoflagellates. Despite overall uniformity of transcript editing, some cDNAs exhibit pre-edited states. K. micrum cox3 and cob contain 50 and 44 editing sites, respectively, with the cDNAs analyzed here representing in total 343 and 231 potential editing events, respectively. However at nine of these sites in the cox3 cDNAs, and five in the cob cDNAs, the pre-edited nucleotide occurs, indicating 2.6% and 2.2% 'non-edits', respectively. These 'non-edits' were present in only a few cDNAs (two and three for cob and cox3, respectively), suggesting that the great majority of cDNAs represent mature transcripts. The pre-edited sites are scattered throughout the transcripts where they are found, occur between other edited sites, and in no obvious order in any sequence. These pre-edited sites may indicate editing failures, in which case such transcripts could give rise to defective translation products. Alternatively, they may represent editing intermediates. If the latter is the case these data suggest that editing does not occur in a linear sequence along each transcript. Pre-edited mitochondrial cDNAs have also recently been found in A. carterae mtDNA [20].
rRNA transcripts
Comparisons of rRNA cDNAs to genomic sequences are constrained by the smaller sizes of these sequences (for example 63 nucleotides for RNA7), in particular where PCR has been used to amplify genomic sequence a greater portion of this sequence represented primer binding sites and therefore cannot be used in such a comparison. Nevertheless, from the available data, there is no evidence of editing of RNA8, RNA10 or RNA7. For LSUE, complete genomic sequence (170 nucleotides) was available from the internal regions of five PCR fragments, with the majority of the sequence available from a further four PCR products using LSUE primers. These sequences were identical to the cDNAs except for three consecutive nucleotides that were absent in two of the three LSUE cDNAs obtained from the EST survey. To test this anomaly, a further five cDNAs were independently generated, and these all contained the three nucleotides, and therefore were identical to genomic LSUE sequences and to one of the original EST sequences. These results suggest that the three-nucleotide deletions seen in two cDNAs represent a rare artifact, likely generated during reverse transcription, and that K. micrum LSUE is likely also not edited.
There was, however, evidence of substitutional editing for LSUA and LSUG. In both cases genomic copies of these sequences differed from transcripts: in LSUG at eight positions and in LSUA at six positions (Figures 5 and 6). Consistent with the protein-coding genes, these substitutions consist mainly of A to G (36%), C to U (43%) and U to C (14%) substitutions, with one case of C to G. Given that dinoflagellate mitochondrial genes occur in multiple copies, recovery of further, independently isolated copies of these genes will be required to substantiate these inferences of rRNA editing. Evidence for rRNA editing has also recently been reported with the dinoflagellate A. catenella, where two inferred editing events were identified for the 'LSUE-like' rRNA [17].