Dynamic gain and loss reveals an unstable genomic region
We found orthologues of Argfx, Dprx, Leutx, Tprx1 and Tprx2 in diverse species from the main eutherian branches – Boreoeutheria (comprising Euarchontoglires and Laurasiatheria) and Atlantogenata (comprising Afrotheria and Xenarthra) – but not in monotremes or marsupials (Fig. 1). In human, Tprx1 and Tprx2 flank the Crx gene on chromosome 19, Dprx and Leutx are more distant on chromosome 19, and Argfx is on chromosome 3. In each eutherian mammal genome examined, the genes are located in syntenic positions defining orthology relationships (Additional file 1: Figure S1). We also uncover a previously undescribed locus we term Pargfx (Parent of Argfx) in several carnivores (dog, cat and ferret) and odd-toed ungulates (rhinoceros and horse), adjacent to Dprx (Fig. 1; Additional file 1: Figure S1; Additional file 2: Figure S2); a retroposed copy is present in primate genomes. These data indicate that Argfx, Dprx, Leutx, Pargfx and Tprx are eutherian-specific homeobox gene families that originated after the divergence of the placental and marsupial lineages.
Lability did not cease after divergence of the major mammalian lineages. We detect examples of additional gene gain, including an extra Leutx in elephant and hyrax, 14 Leutx copies in the guinea pig genome, at least 8 Tprx duplicates in horse and 10 in bat, and an extra Tprx gene shared by cows and pigs (Figs. 1 and 2; Additional file 3: Table S1). The cow/pig Tprx3 locus is unusual in being located between 4.5 and 8 Mb away from Tprx1 and Tprx2, within the leukocyte receptor complex – a cluster of immunoglobulin-like receptor genes [22].
We also detect many cases of gene inactivation and loss. All species examined have lost one or more of the eutherian-specific homeobox gene families; for example, humans have lost Pargfx. For both Dprx and Pargfx, degeneration has occurred repeatedly on at least three occasions within mammals (Fig. 1; Additional file 3: Table S1). The most extreme loss is seen in mouse and rat, where Argfx, Dprx, Leutx and Pargfx are all missing.
Tprx1 and Trpx2 are not readily found in mouse and rat, but in this case, our analyses reveal the explanation is not gene loss but extensive sequence divergence, contrary to previous reports [12, 23]. The mouse genome has a highly divergent double-homeobox gene Crxos in the position syntenic to Tprx1, and a massive expansion of approximately 67 tandem loci (including pseudogenes) of the murid-specific Obox genes in the position syntenic to Tprx2. The extreme sequence differences between Crxos and Tprx1, and between Obox genes and Tprx2, led to the proposal that Crxos and Obox genes arose independently in rodents, after loss of Tprx1 and Tprx2 [12, 23]. To test if Crxos and Obox loci are cryptic orthologues of Tprx genes, we examined genomes of other rodent species. We found that squirrels and ctenohystricans (naked mole-rat and guinea pig) have a complement of PRD class genes similar to non-rodent eutherian mammals. Furthermore, inclusion of genes from guinea pig and naked mole-rat into phylogenetic analysis broke the long branches to Crxos and Obox genes, and gave high support for their orthology with Tprx1 and Tprx2, respectively (Additional file 4: Figure S3). In several species, evolutionary relationships are further disguised by apparent gene conversion between Tprx1 and Tprx2.
The generation of a new set of homeobox genes in eutherian mammals is highly unusual. The continued duplication of these genes, mirrored by extensive and recurrent gene loss, combine to reveal an exceptionally dynamic region of the eutherian genome (syntenic to part of human chromosome 19), expanding and contracting to spawn and delete new homeobox genes at a high rate.
Crx gave rise to Argfx, Dprx, Leutx, Pargfx and Tprx by asymmetric evolution
Analysis using homeodomain sequences does not unambiguously reveal the progenitor gene for Argfx, Dprx, Leutx, Pargfx and Tprx genes from within the PRD class [1]. However, we discovered that we could expand the length of phylogenetically informative sequence by including amino acid stretches C-terminal to the homeodomain conserved between Argfx, Leutx, Pargfx and Otx proteins, corresponding to the Otx-specific domain [24] (Additional file 2: Figure S2). Phylogenetic analysis using this alignable region plus the homeodomain (Additional file 4: Figure S3) revealed that the eutherian Crx gene is the sister (and progenitor) to Argfx, Leutx and Pargfx (Fig. 3a). Although we did not detect the same motif in Dprx or Tprx genes, the tight physical linkage to Pargfx or Crx, respectively, plus the PRD class homeodomain assignment make it highly likely that these genes are also divergent cryptic paralogues of Crx. Since duplicated loci can sometimes retain similar conserved non-coding elements (CNEs) [25, 26], we also turned to non-coding DNA for additional evidence of ancestry. We divided the human TPRX1-CRX-TPRX2 genomic region into fragments and performed VISTA comparisons between them using CRX as a reference (Fig. 3b). This revealed five copies of a duplicated CNE, all of which could also be detected in other placental mammal species; non-eutherian species contain only a single copy of the CNE adjacent to Crx (Fig. 3b; Additional file 2: Figure S2). Of the five copies in human, one is located downstream to each of TPRX1, TPRX2 and CRX, in the same relative orientation, giving strong evidence that Tprx genes are derived from Crx by tandem duplication and asymmetric divergence (Fig. 3b). Interestingly, the additional two CNE copies are more distant and not associated with homeobox loci, and are most likely remnants of two further ‘ghost’ Crx tandem duplicates, lost early in mammalian evolution as part of the dynamic evolution elucidated above.
Functional constraints in divergent homeodomains
The extreme sequence divergence of Argfx, Dprx, Leutx, Pargfx and Tprx gene from their highly conserved progenitor, Crx, together with the prevalence of gene loss and pseudogenes, raises questions about whether they are functional. We also note that sequence differences between species, inferred by phylogenetic branch lengths, are higher than for other homeobox genes (Fig. 3a; Additional file 4: Figure S3). To assess this quantitatively, we used phastCons, a phylogenetic hidden Markov model-based method that estimates the probability that each nucleotide is part of a conserved element, regardless of coding potential [27]. Applied to Leutx, Tprx1 and Tprx2, this identified just 13 to 19 nucleotide positions with > 50 % probability of belonging to a conserved element; Argfx and Dprx showed more apparent conservation with 95 and 279 positions. This contrasts to over 600 positions for the progenitor gene Crx. Nonetheless, for each gene, the highest similarity and conservation probability between eutherian species is consistently the third alpha helix of the homeodomain, suggestive of selective pressures to retain DNA-binding (Fig. 4a).
Specificity of expression can also be a clue to functionality. To assess this, we first built improved gene models for human ARGFX, DPRX, LEUTX, TPRX1 and TPRX2, refining intron/exon boundaries using sequence alignment and transcriptome data (Additional file 2: Figure S2). After integrating these into a human reference gene dataset, we mapped available RNAseq reads from human tissues and developmental stages to the human genome, and plotted heatmaps of expression (Fig. 4b). This revealed tightly regulated temporal expression profiles for each gene restricted to pre-blastocyst stages. ARGFX and TPRX1 are activated with embryonic genome activation at the 8-cell stage, with lower level DPRX and LEUTX transcription initiated slightly earlier. TPRX2 has the lowest expression but mirroring TPRX1. The expression level for all five genes (ARGFX, DPRX, LEUTX, TPRX1 and TPRX2) crashes after the morula stage, and the genes are not strongly expressed again in embryonic, foetal or adult tissues. These expression profiles are tighter and developmentally earlier than the classical pluripotent stem cell markers POU5F1, NANOG, KLF4 and SOX2, which are expressed in the blastocyst and in hESCs (Fig. 4b). The expression profiles are also different to the human orthologue of the progenitor gene, CRX. To examine if the expression patterns differ between species, we also analysed RNAseq data from cow embryos and adult tissues [28, 29]. This revealed that bovine Leutx, Tprx1, Tprx2 and Argfx gene expression peaks sharply at the 8-cell to 16-cell stages, very similar to expression of their human orthologues (Fig. 4c). Dprx is a pseudogene in cow. As noted above, the orthologues of Tprx genes are extensively duplicated and highly divergent in mice, but even so some of these are expressed primarily between the 2-cell or 4-cell stage and morula [30, 31], with transcripts from some loci in oocytes, later embryogenesis and placenta [32, 33]. The finding that the Argfx, Dprx, Leutx and Tprx genes are phylogenetically related to each other, eutherian-specific and expressed primarily in pre-blastocyst totipotent stages of human, cow and mouse development suggests a collective name is helpful. We denote these genes, plus Pargfx, the ETCHbox (Eutherian Totipotent Cell Homeobox) genes.
Ectopic expression of ETCHbox genes reveals differential transcriptional activities
Since the human embryonic stages expressing ETCHbox genes are not readily amenable to experimental analysis, we deployed ectopic expression to gain insight into functions. The mouse system is not appropriate since mouse has lost most ETCHbox genes while extensively duplicating Tprx genes. Based on the refined gene models described above, we used a combination of exon-specific genomic PCR and gene synthesis to build constitutive expression constructs for the four most highly-expressed human ETCHbox genes: ARGFX, DPRX, LEUTX and TPRX1. Reading frames were C-terminal tagged with V5 to facilitate protein detection, and constructs transfected into primary human fibroblasts. Western blot analysis confirmed that each construct generated a full-length protein (Additional file 5: Figure S4). Immunocytochemistry revealed all four proteins localise to the nucleus, but with subtly different subcellular patterns. ARGFX and TPRX1 proteins were clearly localised to the nucleus, apart from nucleoli; DPRX protein was strongest in the nucleus but also in cytoplasm; LEUTX was nuclear-localised but with evidence of disruption to cell integrity (Fig. 5a). The significance of the subtle differences is unclear; the predominant nuclear location is as expected for transcription factors. Nuclear localisation was also found using specific antibodies to ARGFX and DPRX proteins expressed in HeLa cells from untagged constructs (not shown).
We used transcriptome analysis to assess if ectopic expression in fibroblasts caused activation or inhibition of specific genes and genetic pathways. After culture for 48 h, RNAseq in triplicate samples was performed using the Illumina platform to detect up- and down-regulated genes and pathways compared to control transfected cells. Differential biological effects of each ETCHbox gene are evident when lists of significantly up- and down-regulated genes are compiled and compared between treatments, with DPRX expression causing a smaller change to the total transcriptome (Fig. 5b). This revealed between 1143 and 1839 genes significantly up- or down-regulated by ARGFX, LEUTX or TPRX1, of which 250–754 had a fold-change of > 1.25 (equivalent to 2-fold per transfected cell). There is very strong overlap between gene sets down-regulated by LEUTX and TPRX1 (Fisher’s exact test P = 0) and significant overlap between their up-regulated genes (P = 10–73). ARGFX has a large number of gene-specific effects. These experiments verify transcriptional activity of the ETCHbox proteins and suggest partial overlap of function between LEUTX and TPRX1.
Relevance of transcriptional activities to human embryonic development
To investigate if downstream genes detected in cell culture are functionally relevant to early human development, we deployed a temporal clustering approach. First, we determined the temporal expression profiles for all genes expressed between the oocyte and blastocyst stages of human development. A gene expression clustering approach identified 106 distinct expression profiles. Second, we tested whether any of these profiles are significantly enriched in genes up- or down-regulated by transfection of ARGFX, LEUTX or TPRX1, determined above. DPRX was not considered due to the smaller number of specific transcriptional effects elicited by transfection.
We detect a strong and striking enrichment of ARGFX, LEUTX and TPRX1 responsive genes within human expression profile 27 (Fig. 6a,b). This set of 50 human genes has low or zero expression in the oocyte, zygote, 2-cell and 4-cell stages, then a sharp transition to high expression at the 8-cell and morula stages, and a rapid drop in expression before the blastocyst stage; these genes therefore have a pulse of gene expression around genome activation and prior to cell fate determination. This profile differs from several others that show an increase of expression at genome activation but distinct patterns of subsequent expression. The only other signal detected is slight enrichment for LEUTX up-regulated genes in profile 40 (χ2
P = 0.02). It is notable that, for ARGFX, it is the set of genes that are up-regulated by transfection in cultured cells that is strongly enriched within profile 27 (P = 2 × 10–7), while for LEUTX and TPRX1 enrichment is detected for genes down-regulated by transfection (P = 5 × 10–4). Comparison of genes underlying the enrichment in each case reveals extensive overlap (Additional file 6: Figure S5). To examine if expression of putative targets is conserved in other mammals, we examined their bovine orthologues. The 50 human genes have 33 bovine orthologues with variable expression; clustering revealed five temporal profiles. Three profiles, containing 58 % of comparable genes (19/33), have expression profiles very similar to human profile 27, with a clear peak of expression at the 8-cell to 16-cell stages (Additional file 7: Figure S6).
The discovery that a set of genes in profile 27 responds to expression of human LEUTX and TPRX1 in the opposite direction to their response to ARGFX prompted us to test if this relationship extends to the full dataset of responsive genes. Comparison of all human targets down-regulated by TPRX1 and by LEUTX (not just those in profile 27) with all genes up-regulated by ARGFX, reveals over 100 target genes in common (Fig. 6c); ARGFX-up/LEUTX-down overlap P = 1.2 × 10–116, ARGFX-up/TPRX1-down P = 9.3 × 10–122 (Fisher’s exact test). No other combination of three datasets reveals a similar extent of overlap (Additional file 8: Figure S7). These analyses suggest that human LEUTX and TPRX1 genes act on a similar set of genes and genetic pathways between the 4-cell stage and the blastocyst, in an antagonistic manner to the effects of ARGFX. In several cases, ARGFX is causing an increase in the transcription of genes that specifically rise sharply in expression before the 8-cell stage, whereas LEUTX and TPRX1 down-regulate these same genes, which, in the embryo, drop dramatically in expression between morula and blastocyst. Together, we suggest that LEUTX, TPRX1 and ARGFX transcription factors modulate the precise on/off ‘pulse’ of expression that characterises profile 27, at the time when the very first cell fate decisions are made in the early mammalian embryo, leading to specification of embryonic and extra-embryonic cell lineages.