Modified base-binding EVE and DCD domains: striking diversity of genomic contexts in prokaryotes and predicted involvement in a variety of cellular processes

Bell, Ryan T.; Wolf, Yuri I.; Koonin, Eugene V.

doi:10.1186/s12915-020-00885-2

Research article
Open access
Published: 04 November 2020

Modified base-binding EVE and DCD domains: striking diversity of genomic contexts in prokaryotes and predicted involvement in a variety of cellular processes

BMC Biology volume 18, Article number: 159 (2020) Cite this article

3148 Accesses
4 Citations
3 Altmetric
Metrics details

Abstract

Background

DNA and RNA of all cellular life forms and many viruses contain an expansive repertoire of modified bases. The modified bases play diverse biological roles that include both regulation of transcription and translation, and protection against restriction endonucleases and antibiotics. Modified bases are often recognized by dedicated protein domains. However, the elaborate networks of interactions and processes mediated by modified bases are far from being completely understood.

Results

We present a comprehensive census and classification of EVE domains that belong to the PUA/ASCH domain superfamily and bind various modified bases in DNA and RNA. We employ the “guilt by association” approach to make functional inferences from comparative analysis of bacterial and archaeal genomes, based on the distribution and associations of EVE domains in (predicted) operons and functional networks of genes. Prokaryotes encode two classes of EVE domain proteins, slow-evolving and fast-evolving ones. Slow-evolving EVE domains in α-proteobacteria are embedded in conserved operons, potentially involved in coupling between translation and respiration, cytochrome c biogenesis in particular, via binding 5-methylcytosine in tRNAs. In β- and γ-proteobacteria, the conserved associations implicate the EVE domains in the coordination of cell division, biofilm formation, and global transcriptional regulation by non-coding 6S small RNAs, which are potentially modified and bound by the EVE domains. In eukaryotes, the EVE domain-containing THYN1-like proteins have been reported to inhibit PCD and regulate the cell cycle, potentially, via binding 5-methylcytosine and its derivatives in DNA and/or RNA. We hypothesize that the link between PCD and cytochrome c was inherited from the α-proteobacterial and proto-mitochondrial endosymbiont and, unexpectedly, could involve modified base recognition by EVE domains. Fast-evolving EVE domains are typically embedded in defense contexts, including toxin-antitoxin modules and type IV restriction systems, suggesting roles in the recognition of modified bases in invading DNA molecules and targeting them for restriction. We additionally identified EVE-like prokaryotic Development and Cell Death (DCD) domains that are also implicated in defense functions including PCD. This function was inherited by eukaryotes, but in animals, the DCD proteins apparently were displaced by the extended Tudor family proteins, whose partnership with Piwi-related Argonautes became the centerpiece of the Piwi-interacting RNA (piRNA) system.

Conclusions

Recognition of modified bases in DNA and RNA by EVE-like domains appears to be an important, but until now, under-appreciated, common denominator in a variety of processes including PCD, cell cycle control, antivirus immunity, stress response, and germline development in animals.

Background

DNA and different types of RNA of all organisms and diverse viruses contain a variety of modified bases. These derivatives of the canonical purines and pyrimidines perform a broad range of biological functions including regulation of transcription and translation as well as self- versus non-self-discrimination that is required for protection against biological defense and offense systems, such as restriction endonucleases and antibiotics [1,2,3,4,5,6]. The intricate networks of interaction and complex processes mediated by modified bases are far from being completely understood.

Modified bases are often recognized by dedicated protein domains. One such domain, widespread in eukaryotes and prokaryotes, is known as EVE (named for Protein Data Bank (PDB) structural identifier 2eve) [7]. Sequence and structure analyses have shown that the EVE domain is a member of the PUA (pseudouridine synthase and archaeosine transglycosylase)/ASCH (ASC-1 homology) superfamily, a widely disseminated and apparently ancient assemblage of nucleic acid-binding domains [8,9,10,11,12,13]. These domains are generally associated with the translation apparatus, often fused to RNA modification enzymes, and bind RNA themselves [11,12,13,14]. Some ASCH domains have also been predicted to bind modified bases [15].

The first EVE domain to be characterized is found in mammalian thymocyte nuclear protein 1 (THYN1/Thy28), in which it comprises the highly conserved C-terminal region [7]. THYN1/Thy28 was identified as a reader of 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC), as well as further oxidized 5mC derivatives 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC), in DNA [17]. Most eukaryotes encode orthologs of Thy28/THYN1 in which EVE is the only recognized domain, although fusions with AT-hook and other domains in fungi have been described, further supporting the role of EVE as a DNA-binding domain in these proteins [15]. The PUA-like SRA (SET and RING-associated) domain also binds 5mC and 5hmC DNA [17, 18]. However, a different PUA-like domain, YTH (YT521-B homology), shows the closest structural similarity to EVE [10]. The YTH domain also binds modified bases, recognizing N⁶-methyladenosine (m⁶A) in RNA, in the case of eukaryotic proteins, and m⁶A DNA, in the case of archaeal proteins [19, 20]. The conserved core of the PUA/ASCH superfamily consists of a 5-stranded β-barrel (Fig. 1), often with an α-helix between strands 1 and 2, a structural element that is present in EVE domains, which also contain an additional sixth strand in the β-barrel [10] (Fig. 1).

THYN1/Thy28 was originally identified as one of about 300 previously uncharacterized genes that are preferentially expressed in human CD34+ hematopoietic stem/progenitor cells [21]. Shortly afterwards, a cDNA was isolated from apoptotic avian thymocytes encoding a 242 amino acid protein with 88% amino acid similarity to THYN1/Thy28 [22]. Initial cloning and characterization of murine THYN1/Thy28 established nuclear localization and found protein levels to be the highest in testis, with thymus, spleen, liver, and kidney also displaying substantial expression [23]. In a more recent study, nuclear THYN1/Thy28 has been detected in nearly all human tissues [24].

Several studies have explored the role of THYN1/Thy28 in lymphocyte model systems where programmed cell death (PCD), also known as apoptosis, can be induced by antibody treatment. Decreased THYN1/Thy28 protein expression was observed following induction, suggesting that downregulation of this gene is associated with apoptosis initiation [23]. Conversely, overexpression of THYN1/Thy28 was correlated with inhibition of several apoptotic events, such as loss of mitochondrial membrane potential and caspase-3 activation [25]. Furthermore, these experiments have demonstrated accumulation of cells in G1 phase following THYN1/Thy28 overexpression, suggesting that this protein is involved in the regulation of cell cycle progression.

We were interested in the apparently diverse but poorly characterized functions of the EVE domains, and in particular, in the potential roles of modified base recognition in various biological processes. Here, we report a comprehensive bioinformatic analysis of the broad phyletic distribution of EVE-like domains, with an emphasis on the radiation among Proteobacteria, intriguing associations with base modification-dependent restriction and toxin-antitoxin systems, and the identification of the Development and Cell Death (DCD) domain as a member of the EVE-like superfamily. We apply the “guilt by association” approach [26,27,28,29,30] to make functional inferences from an extensive comparative analysis of the expanded collection of bacterial and archaeal genomes.

Results

A census of EVE proteins

Our search for EVE proteins using PSI-BLAST and HHpred seeded with profiles derived from multiple alignments of the amino acid sequences of known EVE domains (see “Materials and methods” for details) showed that the EVE domain is most prevalent among Proteobacteria, which harbor the majority of all prokaryotic EVE proteins detected (Additional file 1: Fig. S1) and a plurality of all EVE proteins. CLANS analysis [31] of EVE domains extracted from all EVE proteins in the dataset revealed a diverse cloud of sequences, with four well-defined clusters (Fig. 2). The largest cluster (blue in Fig. 2) consists, mostly, of sequences from β- and γ-proteobacteria, as well as those from the metazoa and fungi. The second largest cluster (red) includes mostly sequences from α-proteobacteria and Bacteroidetes, as well as the majority of plant sequences. Two smaller, almost completely prokaryotic clusters were also identified. The first (green) represents a collection of sequences largely from Proteobacteria, Actinobacteria, and Bacteroidetes. These EVE domains are usually encoded in operonic contexts which imply a role in ligand-activated transcriptional regulation. The second (purple) is mostly made up of sequences from γ-proteobacteria, Firmicutes, and Bacteroidetes and is unique in that the EVE domains in this group are almost always fused to a GNAT-like (GCN5-related N-acetyltransferase) domain.

We chose to focus our initial analysis on the two large clusters which consist, mostly, of proteobacterial EVE domains. α-proteobacteria were the most abundant class in the data, from which the majority of sequences in the second largest cluster (red in Fig. 2) derive.

EVE in α-proteobacteria

The EVE proteins of this class (Fig. 3) are frequently located in a putative operon with the tRNA N⁶-adenosine threonylcarbamoyltransferase TsaD, glycerol-3-phosphate dehydrogenase GpsA, and YciI, a small ferredoxin-fold protein homologous to muconolactone isomerases [32]. The sequences of the EVE domains in this group are readily recognizable (RPS-BLAST E-values of ~ 1e−42 or better with the pfam01878 query) and form a tight, well-conserved collection with within-group divergence comprising only 35% of the overall divergence between EVE domains (see “Materials and methods” for details).

This highly conserved directional unit (TsaD→GpsA→YciI→EVE) is itself strongly associated with another predicted operon which encodes 3 enzymes of heme biosynthesis, namely, porphobilinogen deaminase (HemC), uroporphyrinogen-III synthase (HemD), and coproporphyrinogen oxidase (HemY/HemG), as well as a diverged homolog of HemX, a putative uroporphyrinogen-III C-methyltransferase that is also homologous to IMMP (inner membrane mitochondrial protein, also known as mitofilin) [33]. In Rhodobacteraceae, HemC is missing from this generally well-conserved gene order. Head to head orientation of these putative operons suggests that the promoter regions might overlap, allowing for co-regulation.

The association between the EVE domain and cytochrome c biosynthesis via regulation of heme production in α-proteobacteria is further emphasized by the presence of a cytochrome c biosynthetic cluster (CcmC through CcmI) adjacent to the EVE domain that is conserved in both the Acetobacteraceal branch of Rhodospirillales and Sphingomonadales (Fig. 3) [34]. In Sphingomonadales, a likely operon including the tRNA-modifying enzyme MiaB, which adds a methylthio group to N⁶-isopentenyladenosine at position 37 in many tRNAs decoding UNN (the same position modified by TsaD), often occurs between the EVE domain and the cytochrome c biosynthetic operon [35].

A contextual information network graph generated from the pairwise domain associations in prokaryotic EVE protein genomic neighborhoods showed that in α-, β-, and γ-proteobacteria, respectively, the EVE proteins are associated with highly conserved, but largely non-overlapping gene complements (Fig. 4).

EVE in β- and γ-proteobacteria

A prominent exception to the general lack of overlap between the contextual information networks among Proteobacteria is the conservation between β- and γ-proteobacteria of an apparent operonic linkage of EVE proteins and the cell division proteins ZapA and ZapB (Fig. 4). The sequences of the EVE domains in these proteins are also highly recognizable, slightly more so, in fact, than those in α-proteobacteria (RPS-BLAST E-values of ~ 5e−55 or better). They likewise form a tight, well-conserved group, with their within-group divergence accounting for only 32% of the overall divergence between EVE domains. The protein-coding gene array ZapB→ZapA→EVE also contains, between ZapA and EVE, a non-coding 6S RNA (ssrS) gene. Our analysis of these neighborhoods suggests that the ssrS gene is (nearly) always present, based on the positions of the protein-coding genes, leaving a gap sufficient to accommodate the 6S RNA, but are not consistently annotated, conceivably, due to sequence divergence. For this reason, ssrS was not included in our calculations that produced the contextual information network graph (Fig. 4).

In many species of γ-proteobacteria and some β-proteobacteria, the enzyme FAU1/MFTHFS, also known as YgfA, a putative 5-formyltetrahydrofolate cyclo-ligase, is encoded between ZapB→ZapA→SsrS and the EVE protein (Fig. 5). In γ-proteobacteria, another directional gene array is frequently found adjacent to this predicted operon in a head to head orientation, with the potential for the promoter regions to overlap. It encodes an uncharacterized conserved protein (YgfB), an Xaa-Pro aminopeptidase (PepP), and a homolog of 2-octaprenyl-6-methoxyphenol 4-hydroxylase (UbiH), an FAD-dependent oxidoreductase, as well as a homolog of 2-octaprenylphenol 6-hydroxylase (UbiI), both of which are involved in ubiquinone biosynthesis (Fig. 5) [36]. Many of the γ-proteobacterial neighborhoods additionally include genes encoding homologs of ribose-5 phosphate isomerase (RpiA) and l-threonine dehydratase (IlvA).

In β-proteobacteria, the gene coding for the EVE protein is often followed by a gene encoding the ortholog of the TauE sulfite export protein (Fig. 5). In Burkholderiaceae and Neisseriales, a cobalamin (vitamin B-12) biosynthetic cluster is often found immediately adjacent to the ZapAB→SsrS→(FAU1)→EVE unit (Fig. 5). In Burkholderiales, a conserved region encoding a cytochrome c551/c552 family protein, dihydroxy-acid dehydratase (IlvD), a putative transcriptional regulator related to LysR, and prolipoprotein diacylglyceryl transferase (LGT) is adjacent to the ZapAB→SsrS→EVE putative operon.

The recently updated phylogeny of the β- and γ-proteobacteria [37] allows some inferences to be made concerning the evolutionary history of the predicted functional systems containing the EVE domain. The taxonomic distribution of the ZapAB→SsrS→(FAU1)→EVE unit covers β-proteobacteria, several early branching members of γ-proteobacteria (Xanthomonadales, Chromatiales, Methylococcales, etc.), and the clade primarily consisting of Pseudomonadales and Oceanospirillales. This broad taxonomic representation implies that the unit was present in the common ancestor of β- and γ-proteobacteria. The VAAP clade (Vibrionales, Alteromonadales, Aeromonadales, and Pasteurellales) have lost this association, and each order, with the exception of Aeromonadales, possesses distinct conserved regions neighboring encoded EVE proteins (Fig. 12, Additional file 1: Figs. S2–4). In E. coli K-12, both of the typical EVE-associated γ-proteobacterial operons and their orientations are conserved, but the ZapB and EVE domain proteins have been lost (Fig. 5).

In agreement with the CLANS results, in the phylogenetic tree of the EVE domains, the EVE proteins from most of the higher plants branch from within the α-proteobacterial clade, whereas EVE proteins from the metazoa, fungi, and some plants are more similar to γ-proteobacterial domains, but lie outside of the γ-proteobacterial variation (Additional file 1: Fig. S5). Due to the small size of the EVE domain, phylogenetic analysis cannot confidently identify the prokaryotic ancestry of these domains in eukaryotes, although Proteobacteria are the most likely contributors, with possible multiple acquisitions.

EVE domains in putative ligand-activated antibiotic resistance and other ligand-activated responses

The largest of the almost exclusively prokaryotic clusters from our CLANS analysis (green in Fig. 2) was populated predominantly by domains encoded in the operonic context of a transcription factor and a small molecule ligand-binding domain (Fig. 6). The most frequent putative operons encoded an EVE domain with either a MarR (multiple antibiotic resistance) family transcription factor or a YafY family transcription factor. YafY-like factors are a fusion of a putative DNA-binding HTH domain and a WYL domain, a ligand-binding regulator of prokaryotic defense systems [38,39,40]. The MarR-EVE and YafY-EVE pairs are further associated, most frequently, with a ligand-binding domain of the SPRBCC (START/RHO_alpha_C/PITP/Bet_v1/CoxG/CalC) or EhpR (phenazine antibiotic resistance) families. EhpR family proteins contain a vicinal oxygen chelate (VOC) domain, and other VOC domain homologs are also frequently encoded in the neighborhoods of this class of EVE proteins, often replacing SPRBCC domains in association with MarR-EVE pairs (Fig. 6). The sequences of this group of EVE domains formed a distinct clade in our phylogenetic analysis (Additional file 1: Fig. S5). The regions surrounding these apparent 3-component systems are highly diverse. They include putative defense functions in Nocardia and related genera, where multiple paralogs of UvrD-like helicase domains fused to Cas4-like PD(D/E)XK phosphodiesterases [41] are present (Additional file 1: Fig. S6). Conversely, in Azospirillum, the neighborhoods include translation factor genes and cytochrome c biosynthesis operons, a context that is, surprisingly, closely similar to the distinct classes of EVE proteins in the two largest clusters in our CLANS analysis (Additional file 1: Figs. S7).

EVE as a specificity domain in modification-dependent restriction systems

The EVE proteins in Proteobacteria and eukaryotes found in the two largest clusters we observed with CLANS analysis show high levels of sequence conservation. By contrast, many genome defense systems encompass EVE domains with more pronounced sequence diversity. These domains range from highly significant matches to hits with weaker similarity, and many could be detected only with sensitive methods such as HHpred. A substantial variety of putative modification-dependent (type IV) restriction endonucleases (REs) with core architectures of EVE-PD(D/E)XK phosphodiesterase and EVE-HNH endonuclease were identified in our searches (Fig. 7). Furthermore, we identified numerous proteins containing fusions of the EVE domain with nucleases of the phospholipase D (PLDc) or GIY-YIG superfamilies (Fig. 7). Rare fusions to homologs of the glucosylated 5hmC-dependent RE GmrSD were detected as well.

The EVE domain is also frequently incorporated into homologs of the GTP-dependent DNA translocase McrB. In E. coli K-12, McrB, in concert with McrC, a PD(D/E)XK-type nuclease that interacts with McrB hexamers via its N-terminal domain, restricts N4-methylcytosine (4mC)/5mC/5hmC-containing DNA; in this strain, EVE is replaced with a DUF3578 family domain as the specificity module [20, 42,43,44]. Overall, the EVE-McrB combination is the most common domain architecture among the EVE-containing proteins in defense systems, represented in nearly 300 bacterial and archaeal genera, and is particularly abundant among Firmicutes and Bacteroidetes (Fig. 7).

In diverse archaea, a recurrent partnership was observed between standalone EVE domain proteins and a predicted, uncharacterized restriction system that encodes a SWI2/SNF2 helicase fused to a nuclease (PD(D/E)XK or PLDc family). This gene is expressed in an operon that also encodes a methyltransferase of COG1743 and an uncharacterized DUF499-containing protein (Additional file 1: Fig. S8). Our analysis showed that DUF499 is homologous to CDC6/ORC1 ATPases, which are involved in the recognition of the origin of DNA replication in archaea and eukaryotes [45, 46].

EVE domains in toxin-antitoxin systems

A major class of EVE proteins which formed a distinct cluster in our CLANS analysis (purple in Fig. 2) is a fusion of EVE to the C-terminus of a GNAT-like acetyltransferase, often with a PIN RNase domain at the N-terminus (Fig. 7). GNAT and PIN domains both frequently function as toxins [47,48,49]. This variety of EVE proteins has been described previously in some detail by Iyer et al., who proposed that these proteins acetylate a DNA base, although the frequent presence of a PIN domain suggests that these systems employ RNA as a target or guide [15]. As also addressed in that study, almost all (PIN)-GNAT-EVE operons encode a protein containing a second PUA-like domain, ASCH, and often, also, an AAA+ ATPase of the AAA_17 family. In some cases, mostly in α-proteobacteria, the ASCH domain is fused to a helix-turn-helix (HTH) DNA-binding domain of the xenobiotic response element (XRE) family.

The distributions of the PIN-GNAT-EVE and the GNAT-EVE fusion proteins among prokaryotes are notably different (Fig. 8). The PIN-GNAT-EVE proteins are frequently found in bacterial and archaeal genomes in a close association with type I restriction-modification (RM) systems (HsdR/M/S operons). A consistent proximity between PIN-GNAT-EVE proteins and other types of defense systems, such as CRISPR-Cas and the COG1743→DUF499→SWI2/SNF2 helicase-nuclease operon described above, was also observed (Fig. 8). By contrast, GNAT-EVE proteins are not associated with type I RM systems but are commonly located within prophages in β- and γ-proteobacteria (Fig. 8).

In addition to the profusion of putative TA systems containing EVE domains, EVE is also regularly found as a standalone protein closely associated with type I RM systems (Fig. 10). These systems often also contain an ASCH domain and are mostly found in archaea. In effect, RM systems exhibit toxin-antitoxin functionality, with the restriction endonuclease playing the role of toxin, whereas the methyltransferase is its antitoxin [47, 50,51,52,53]. Accordingly, the EVE domains are likely to play similar roles in these systems, namely, targeting the toxins (including restriction endonucleases) to modified nucleic acids.

MmcQ/YjbR-EVE fusion proteins

Related to the RM and TA system-associated EVE proteins is a class of MmcQ/YjbR-EVE fusions that we found associated with a number of defense gene clusters, as well as signaling, transport, and metabolic factors, mostly, in Firmicutes and γ-proteobacteria (Fig. 7). MmcQ/YjbR (PF04237) has a CyaY-like fold and is also fused to tellurite resistance protein TerB and GNAT-type acetyltransferases in other contexts [54]. MmcQ/YjbR-EVE fusions also frequently contain an N-terminal DUF1831 domain, and in many cases, where this domain is missing, there is a DUF1831-MmcQ/YjbR gene immediately adjacent to MmcQ/YjbR-EVE.

DUF1831-MmcQ/YjbR-EVE fusions, which are the most numerous in our data, are frequently encoded within a genomic context that includes sensor histidine kinases, response regulators, and putative DNA-binding proteins. They are also often associated with ABC-type transport system components. Intriguingly, the large number of currently available Streptococcus genomes enabled the detection of highly variable regions adjacent to the genes encoding DUF1831-MmcQ/YjbR-EVE proteins in conserved positions. These areas often contain mobile genetic elements (MGEs), defense-associated genes (TA modules, CRISPR-Cas systems), as well as uncharacterized, putative defense, transport, secretory, and DNA/protein repair genes (a MsrAB/disulfide interchange factor operon we detected is likely a mobile protein repair system) [55] (Fig. 9). These hotspots for integration (and presumably contraction) adjacent to (DUF1831)-MmcQ/YjbR-EVE genes often include transposases, implying a transposon-type mechanism of mobilization. When these variable gene arrays are large, ancestral, independent mobile modules that were assembled to give rise to them can be predicted by comparison with genomes in which the array is smaller (Fig. 9). Further work will be necessary to establish the relationship in these systems between the mobile genes and those conserved at the borders, including DUF1831-MmcQ/YjbR-EVE. The fusion of MmcQ/YjbR-EVE to a transposase in Streptococcus lutetiensis further underscores that this variety of EVE protein might play a role in regulating the acquisition and/or expression of MGEs. We also observed a similar phenomenon in the regions neighboring MmcQ/YjbR-EVE genes in Actinobacillus (Additional file 1: Fig. S4).

DCD, an EVE-like domain involved in restriction of modified DNA and PCD in plants

We further identified the Development and Cell Death (DCD) domain as a specificity module comparable in sequence and genomic context to EVE. The DCD domain is rare in prokaryotes and, mostly, is present in archaea and hyperthermophilic bacteria. The DCD domain was originally identified in proteins that are strongly induced during plant development, the hypersensitive response to avirulent pathogens, and reaction to various environmental stresses in plants [56,57,58]. Although not classified as such previously, we conclude that DCD is a member of the PUA-like superfamily due to the limited but significant sequence similarity with EVE detected by profile-profile comparison using HHpred (97.16% probability, E-value 0.049). Several of the most highly conserved residues of the EVE domains are present in the DCD domains, and the characteristic secondary structure (βαβαββββ) that forms the EVE β-barrel is also predicted for DCD (Additional file 1: Fig. S9). The DCD domain shows some associations similar to those of defense-related EVE domains, in particular, with type I restriction systems, as well as a fusion to PD(D/E)XK phosphodiesterases and McrB-like domains, and is distinguished by frequent fusion to a Rossmann-fold methyltransferase, which is extremely rare among EVE domains (Fig. 10). These connections imply that, similarly to EVE, DCD domains in prokaryotes recognize methylated bases in DNA and thus contribute to restriction of modified DNA. DCD-methyltransferase fusion protein genes are usually followed by a gene encoding a PD(D/E)XK nuclease, suggesting that they are involved in the additional methylation of modified DNA, recognized by DCD, that could be restricted in the absence of the supplementary methylation.

DCD and YTH: EVE-like domains with roles in modification-dependent DNA restriction systems and eukaryotic modification-based mRNA processing

We performed a comprehensive search for the DCD domain in all available genomes and found that, among eukaryotes, it is not restricted to plants, as originally described, but is also present in many chromist genomes, particularly, in heterokont and haptophyte algal proteins, where it is often fused to another EVE-like domain, YTH (Fig. 11a). The YTH domain is broadly distributed in eukaryotes, has been consistently reported to bind m⁶A in eukaryotic mRNAs, and is involved in multiple processes including splicing and polyadenylation, translation/decay balance (notably triaging of mRNA translation during stress), and inhibition of viral RNA replication [19, 60,61,62,63]. When fused to the YTH domain in eukaryotes, the DCD domain is also fused to a KH (K homology) domain, and an array of CCCH-type zinc finger (Znf) domains (Fig. 11a). Similar repeated Znfs are conserved in mRNA cleavage and polyadenylation specificity factor 30 (CPSF30) family proteins that are involved in eukaryotic mRNA maturation (Fig. 11a) [64, 65]. CPSF30-like proteins in plants also contain a YTH domain and are orthologous to the Znf-Znf-Znf-YTH-DCD-KH proteins we detected. CPSF30 orthologs in fungi and metazoans only have Znf domains, but YTH domain proteins are integral to the CPSF complexes in vertebrates, where they interact with CPSF6 (Fig. 11a) [66].

YTH also has been reported to bind m⁶A in DNA when fused to an McrB homolog in the archaeon Thermococcus gammatolerans [20]. Using the sequence of this archaeal YTH domain as a PSI-BLAST query, we detected homologs that, much like DCD, are fused to McrB-like GTPases or PD(D/E)XK nucleases. Most of these YTH-like domains are not clearly distinguishable from EVE domains using HHpred, being modest hits for both types, a pattern that is reminiscent of some prokaryotic DCD domains.

Extended Tudor-DCD fusion proteins in choanoflagellates implicated in the origins of the piRNA pathway

We were unable to identify DCD domains in metazoans. However, when we analyzed the predicted proteins translated from the published transcriptomes of choanoflagellates, the closest unicellular relatives of animals [67, 68], a protein containing a DCD domain fused to an extended Tudor (eTudor) domain was detected in both loricate and non-loricate choanoflagellates, the two main lineages of this phylum (Figs. 11c and 13).

Tudor domains bind post-translationally methylated arginine or lysine residues in eukaryotic proteins [69]. They interact with three main types of modified proteins: histone tails (methylarginine or methyl-lysine), Sm proteins in spliceosomes (methylarginine), and the N-termini of metazoan Piwi-related Argonaute proteins (methylarginine) [69]. The Sm protein-binding Tudor domains present in the splicing factor survival motor neuron (SMN) and related proteins are distinguished from Tudor domains that bind histone tails by an N-terminal α-helix (Fig. 11b) [70]. The Tudor domains that interact with Piwi-related Argonautes are of the eTudor type [59, 69]. The eTudor family is restricted to metazoans, with the exception of Tudor-SN, a highly conserved eukaryotic protein implicated in RNA interference, splicing, microRNA decay, and RNA editing that contains four staphylococcal nuclease (SNase) domains and a single eTudor domain [71,72,73] (Fig. 11b).

Bioinformatic and structural analyses suggest that the eTudor domain arose when a Tudor domain, related to the Tudor domain in SMN, inserted into the fifth, C-terminal SNase domain of an ancestral multi-SNase protein [59, 70] (Fig. 11b). The resulting domain fusion of Tudor and SNase (hence the name “extended Tudor”) became the ancestor of the eTudor family, in which the catalytic residues from the ancestral SNase domain are mutated, likely rendering it inactive [59, 70, 71, 74] (Fig. 11b). Present in all metazoans, multi-eTudor proteins play crucial roles in the localization of Piwi-related Argonautes and biogenesis of Piwi-interacting RNAs (piRNAs) by interacting with symmetrically dimethylated arginine (SDMA) residues in the Argonaute N-termini, and thus, are essential for repression of transposable elements, modulation of germline mRNA levels, and germ/stem cell immortality [69, 75, 76]. The origins of the complex metazoan multi-eTudor proteins derived from Tudor-SN are fundamental to the understanding of the piRNA pathway and animal germline specification but, currently, remain obscure.

We detected eTudor proteins in choanoflagellates that are not orthologs of Tudor-SN, and these represent the first examples, to our knowledge, to be reported in a non-metazoan organism. These proteins usually also contain a DCD domain, and in some cases, a CAPAM (cap-specific adenosine methyltranferase)-like methyltransferase and/or a second eTudor domain (Figs. 11c and 13). In the process of identifying the CAPAM-like domains, we encountered a misannotation of the Pfam family PCIF1_WW (pfam12237), which, according to our analysis, is not a WW domain, but rather a CAPAM-like methyltransferase. We also observed multi-eTudor proteins fused with a YTH domain in some species of coral, among the earliest branching metazoans (Figs. 11c and 13).

Furthermore, we detected links between the eTudor-DCD/YTH fusion proteins and the ubiquitination pathway and protein degradation. An N-terminal ubiquitin-binding domain (UBA) and B-box Znf domains are present in the choanoflagellate eTudor-DCD and coral eTudor-YTH proteins, respectively (Fig. 13). Similar B-box Znfs are found in TRIM ubiquitin E3 ligases and in the eTudor piRNA pathway factor qin/komo from Drosophila, which also contain RING Znfs (Fig. 13) [69, 77]. The eTudor proteins with RING Znfs are conserved throughout the eumetazoa, although in vertebrates, the B-box Znfs appear to have been lost. In the sponge Amphimedon queenslandica, a protein with four eTudor domains and an N-terminal MYND-type Znf has been identified, with orthologs present in most metazoans (Fig. 13).

Biochemical functions of EVE-associated proteins

In this section, we present our inferences of the likely biochemical functions of the proteins linked to the EVE domain, both covalently and non-covalently, which we derived from the literature documenting experimental characterization of members of the corresponding protein families. An important caveat is that these inferences, although often direct and likely valid, are inherently less confident than the robust computational results so far described.