Mutation of amphioxus Pdx and Cdx demonstrates conserved roles for ParaHox genes in gut, anus and tail patterning

Background The homeobox genes Pdx and Cdx are widespread across the animal kingdom and part of the small ParaHox gene cluster. Gene expression patterns suggest ancient roles for Pdx and Cdx in patterning the through-gut of bilaterian animals although functional data are available for few lineages. To examine evolutionary conservation of Pdx and Cdx gene functions, we focus on amphioxus, small marine animals that occupy a pivotal position in chordate evolution and in which ParaHox gene clustering was first reported. Results Using transcription activator-like effector nucleases (TALENs), we engineer frameshift mutations in the Pdx and Cdx genes of the amphioxus Branchiostoma floridae and establish mutant lines. Homozygous Pdx mutants have a defect in amphioxus endoderm, manifest as loss of a midgut region expressing endogenous GFP. The anus fails to open in homozygous Cdx mutants, which also have defects in posterior body extension and epidermal tail fin development. Treatment with an inverse agonist of retinoic acid (RA) signalling partially rescues the axial and tail fin phenotypes indicating they are caused by increased RA signalling. Gene expression analyses and luciferase assays suggest that posterior RA levels are kept low in wild type animals by a likely direct transcriptional regulation of a Cyp26 gene by Cdx. Transcriptome analysis reveals extensive gene expression changes in mutants, with a disproportionate effect of Pdx and Cdx on gut-enriched genes and a colinear-like effect of Cdx on Hox genes. Conclusions These data reveal that amphioxus Pdx and Cdx have roles in specifying middle and posterior cell fates in the endoderm of the gut, roles that likely date to the origin of Bilateria. This conclusion is consistent with these two ParaHox genes playing a role in the origin of the bilaterian through-gut with a distinct anus, morphological innovations that contributed to ecological change in the Cambrian. In addition, we find that amphioxus Cdx promotes body axis extension through a molecular mechanism conserved with vertebrates. The axial extension role for Cdx dates back at least to the origin of Chordata and may have facilitated the evolution of the post-anal tail and active locomotion in chordates.


SECTION 1: GENE STRUCTURES OF AMPHIOXUS PDX AND CDX GENES (a) Verification of Pdx gene structure
The Branchiostoma floridae Pdx (Xlox) gene is comprised of two coding exons, confirmed here by aligning an assembled Pdx transcript sequence to the previously reported sequence of PAC clone 33B4 (GenBank AC129948.3; [2]); Figure S1. The B. floridae Cdx gene is generally depicted as comprised of two coding exons. However, we assembled a B. floridae Cdx transcript that aligned to three regions of the sequenced genomic PAC clone 36D2 (NCBI GenBank AC129947.4; [2]); Figure S2. We propose the small second exon is used differentially and note it is also present in predicted isoforms X1 and X2 (GenBank XP_019625318.1 and XP_019625319.1), but not X3 (XP_019625325.1), of B. belcheri Cdx.

SECTION 2: GENERATION OF MUTATIONS IN AMPHIOXUS PDX AND CDX GENES
TALEN sequences used to target exon 1 of B. floridae Pdx and Cdx genes. For mutagenesis using TALENS, two in vitro transcribed RNAs are injected for each gene; each mRNA includes a region of Repeat Variable Di-residues (RVDs) encoding a sequence specific DNA-binding peptide, coupled to the catalytic domain of FokI nuclease. If two RVDs flank a site of interest, dimerization activates FokI nuclease activity, introducing DNA breaks leading to deletion mutations. Figures S3 to S6 give the sequences of the four in vitro transcribed RNAs used, from the T3 RNA polymerase binding site to the restriction enzyme site used for plasmid linearization before in vitro transcription. Figure S3: Pdx forward TALEN. Red highlight = T3 promoter; blue text = 5' and 3' untranslated regions; red text = open reading frame for TALEN peptide sequence; red underlined text = sequence encoding RVDs; green highlight = SacI linearization site. AATTAACCCTCACTAAAGGGAAGCTTGCTTGTTCTTTTTGCAGAAGCTCAGAATAAACGCTCAACTTTGGCAGATCTAACTCGAGAAA  GATATTGTATATATCGTAACAATAGGAGGTTCAACAATGGCTTCCTCCCCTCCAAAGAAAAAGAGAAAGGTTAGTTGGAAGGACGCAA  GTGGTTGGTCTAGAGTGGATCTACGCACGCTCGGCTACAGTCAGCAGCAGCAAGAGAAGATCAAACCGAAGGTGCGTTCGACAGTGGC  GCAGCACCACGAGGCACTGGTGGGCCATGGGTTTACACACGCGCACATCGTTGCGCTCAGCCAACACCCGGCAGCGTTAGGGACCGTC  GCTGTCACGTATCAGCACATAATCACGGCGTTGCCAGAGGCGACACACGAAGACATCGTTGGCGTCGGCAAACAGTGGTCCGGCGCAC  GCGCCCTGGAGGCCTTGCTCACGGATGCGGGGGAGTTGAGAGGTCCGCCGTTACAGTTGGACACAGGCCAACTTGTGAAGATTGCAAA  ACGTGGCGGCGTGACCGCAATGGAGGCAGTGCATGCATCGCGCAATGCACTGACGGGTGCCCCCCTGAACCTGACCCCGGACCAAGTG  GTGGCTATCGCCAGCAACGGTGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGCCTGA  CCCCGGACCAAGTGGTGGCTATCGCCAGCAACAATGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCA  GGACCATGGCCTGACTCCGGACCAAGTGGTGGCTATCGCCAGCCACGATGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTG  CCGGTGCTGTGCCAGGACCATGGCCTGACTCCGGACCAAGTGGTGGCTATCGCCAGCCACGATGGCGGCAAGCAAGCGCTCGAAACGG  TGCAGCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGCCTGACTCCGGACCAAGTGGTGGCTATCGCCAGCCACGATGGCGGCAAGCA  AGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGCCTGACCCCGGACCAAGTGGTGGCTATCGCCAGCAAC  GGTGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGCCTGACCCCGGACCAAGTGGTGG  CTATCGCCAGCAACATTGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGCCTGACTCC  GGACCAAGTGGTGGCTATCGCCAGCCACGATGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGAC  CATGGCCTGACTCCGGACCAAGTGGTGGCTATCGCCAGCCACGATGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGG  TGCTGTGCCAGGACCATGGCCTGACTCCGGACCAAGTGGTGGCTATCGCCAGCCACGATGGCGGCAAGCAAGCGCTCGAAACGGTGCA  GCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGCCTGACCCCGGACCAAGTGGTGGCTATCGCCAGCAACAATGGCGGCAAGCAAGCG  CTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGCCTGACTCCGGACCAAGTGGTGGCTATCGCCAGCCACGATG  GCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGCCTGACTCCGGACCAAGTGGTGGCTAT  CGCCAGCCACGATGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGCCTGACCCCGGAC  CAAGTGGTGGCTATCGCCAGCAACAATGGCGGCAAGCAAGCGCTCGAAAGCATTGTGGCCCAGCTGAGCCGGCCTGATCCGGCGTTGG  CCGCGTTGACCAACGACCACCTCGTCGCCTTGGCCTGCCTCGGCGGACGTCCTGCCATGGATGCAGTGAAAAAGGGATTGCCGCACGC  GCCGGAATTGATCAGAAGAGTCAATCGCCGTATTGGCGAACGCACGTCCCATCGCGTTGCCTCTAGATCCCAGCTAGTGAAATCTGAA  TTGGAAGAGAAGAAATCTGAACTTAGACATAAATTGAAATATGTGCCACATGAATATATTGAATTGATTGAAATCGCAAGAAATTCAA  CTCAGGATAGAATCCTTGAAATGAAGGTGATGGAGTTCTTTATGAAGGTTTATGGTTATCGTGGTAAACATTTGGGTGGATCAAGGAA  ACCAGACGGAGCAATTTATACTGTCGGATCTCCTATTGATTACGGTGTGATCGTTGATACTAAGGCATATTCAGGAGGTTATAATCTT  CCAATTGGTCAAGCAGATGAAATGCAAAGATATGTCGAAGAGAATCAAACAAGAAACAAGCATATCAACCCTAATGAATGGTGGAAAG  TCTATCCATCTTCAGTAACAGAATTTAAGTTCTTGTTTGTGAGTGGTCATTTCAAAGGAAACTACAAAGCTCAGCTTACAAGATTGAA  TCATATCACTAATTGTAATGGAGCTGTTCTTAGTGTAGAAGAGCTTTTGATTGGTGGAGAAATGATTAAAGCTGGTACATTGACACTT  GAGGAAGTGAGAAGGAAATTTAATAACGGTGAGATAAACTTTTAATAGGCTAGTGACTGACTAGGATCTGGTTACCACTAAACCAGCC  TCAAGAACACCCGAATGGAGTCTCTAAGCTACATAATACCAACTTACACTTACAAAATGTTGTCCCCCAAAATGTAGCCATTCGTATC Figure S6: Cdx reverse TALEN. Red highlight = T3 promoter; blue text = 5' and 3' untranslated regions; red text = open reading frame for TALEN peptide sequence; red underlined text = sequence encoding RVDs; green highlight = SacI linearization site.

Pdx forward TALEN
Each TALEN pair is designed to flank a restriction endonuclease recognition site enabling mutation detection by digestion of a PCR amplified product (AflIII for Pdx, PasI for Cdx). Table 1 gives the mutation detection primers used.

Genes Primer sequences (5'→3') Pdx mutation detection
Forward: TTTCAAACGATACCGGACAAAC Reverse: CCACTGAGACTTCCAGGCGT Cdx mutation detection Forward: TACTGGTTTGTCACGGCGAG Reverse: CTGGGGGAGGACTTGTGGAGTA Table S1: Mutation detection primers B. floridae eggs were injected with pairs of TALEN RNAs, fertilized and embryos reared to neurula stage, using previously described methods [48]. PCR followed by restriction digestion revealed that both TALEN pairs introduced deletion mutations, with a higher frequency detected by the Cdx TALEN pair ( Figure S7). Figure S7: PCR analysis of embryos reared from injected eggs: Left hand gel: Pdx PCR products showing faint band of uncut amplification product (deletion mutants, arrow). Right hand gel: Cdx PCR products showing uncut amplification product (arrow).
The same PCR primers were used to track inheritance of the mutant alleles in adult amphioxus tail clips, in sperm or pools of embryos, and in single embryos after in situ hybridization, using previously described methods [48]. Sequencing of PCR products was also used to verify exact nature of the mutations.
The TALEN-generated Pdx mutants used in this study have deletions of 4 bp, 11 bp and 13 bp ( Figure  1, main text; Figure S8). The predicted mutant peptides are shown beneath the DNA sequences; no predicted product contains a homeodomain. Figure S8: Sites of 4 bp, 11 bp and 13 bp deletions in Pdx gene (red) and predicted protein products (out of frame amino acids underlined). For 4Δ and 11Δ we also identified substitution mutations generated next to the deletion (green). These are taken account of in predicting peptide sequence. The TALEN-generated Cdx mutant used in this study has a deletion of 7 bp (Figure 1, main text; Figure  S9). The predicted mutant peptide retains the first 5 amino acids before a frameshift, then 4 additional residues (underlined) before a premature stop codon. Figure S9: Site of 7 bp deletion in Cdx gene (red) and predicted protein product (out of frame amino acids underlined)

SECTION 6: ANALYSIS OF CYP26-3
We first tested for orthology between three Cyp26 genes from B. floridae and the three reported Cyp26 genes of B. lanceolatum [53].
Figure S14: DNA sequences of the deduced transcribed region of three Cyp26 genes from B. floridae (Bf). Blue text = untranslated regions (UTR), red text = coding sequence, underlined sequence = region used as in situ hybridisation probe for Cyp26-3. We identify five putative Cdx binding sites (TTTATT/AATAAA) in ~3.1 kb upstream of the ATG of Cyp26a-3 (blue highlights in Figure S16). Four are upstream of the deduced transcriptional start site (black text) and one in the 5' untranslated region (blue text). Two were chosen for mutagenesis (bold and underlined); these have a purine residue following TTTATT and best match mouse Cdx binding sites (NTTTATDRBHB; [58]). Figure S16. DNA sequence upstream of start codon of B. floridae Cyp26a-3 gene. Red ATG = start codon; blue text = 5' untranslated region; blue highlight = putative Cdx binding site; underlined blue highlight = binding sites chosen for mutagenesis; yellow highlight = cloning primers.
The 3.1 kb region was cloned into pGL3 basic vector (Promega) between SacI and NheI sites, and the resultant construct ('Cyp26-3 promoter') was used to make mutant constructs with one or two putative Cdx binding sites mutated ('BS1 mutation', 'BS2 mutation' and 'BS1,2 mutation') using a PCR method (Table S3). Controls were empty pGL3 vector and not injected samples.  Table S3: Primers used for cloning and mutagenesis of Cyp26-3 promoter sequence Injection solutions were prepared containing 3 ng/μL Renilla luciferase vector pRL-TK (Promega), 20% glycerol, 5 mg/ml Texas Red dextran, with or without 30ng/μL each of above luciferase constructs. Microinjection into unfertilized amphioxus eggs was conducted as previously described [87]. For each experiment, ~60 embryos were collected at 16 hours post fertilization. Wild type embryos from the same batch were also collected and used as a negative control ('WT'). Levels of luciferase and Renilla were detected with the Dual Luciferase Kit (Promega Co.) using a GloMax luminometer with an integration of 10 seconds. The level of luciferase activity was normalized to the level of Renilla activity for each experiment. All experiments were repeated three times (Table S4) Table S4: Raw values of relative luciferase for each construct and three replicate experiments RFV1 to RFV3. Values compared graphically in Figure 5B.

SECTION 7: DIFFERENTIAL EXPRESSION OF CANDIDATE TARGET GENES
Raw Illumina sequence data are deposited under NCBI BioProject PRJNA594548 [96], including 12 BioSamples (SAMN13521182 to SAMN13521193) and 12 SRA datasets (SRR106745850 to SRR10674591). Expression data inferred by read mapping, and DNA sequences of superTranscripts referred to below, are given in Additional file 2: Supplementary Data (Tabs 2 to 7).

(a) Principal Component Analysis
We used Principal Component Analysis (PCA) to test for outliers and batch effects in the RNAseq replicates before testing for differential gene expression between mutant and control sets. PCA used the plotPCA function of the DESeq2 v3.8 R package [95]. This uses the counts obtained from the featureCounts [94] function of the Subread R package and stores it in the dds object; plotPCA uses data from the rlog transformation of dds.
PCA applied to transcriptomes from the Cdx experiment separated samples according to batch or developmental age along Principal Component 1 (Figure S17 A). Specifically, batch 1 samples (collected at 34 h post-fertilization) were separated from batch 2 samples (collected at 42 h postfertilization) along PC1. In contrast, Principal Component 2 separated samples according to genotype (mutant vs. wild type); Figure S17 A.
Distance matrix analysis also separated batch 1 and batch 2 samples (Figure S17 B).
We therefore conducted two Differential Gene Expression analyses, focussing on Batch 2 only (42 h embryos; results in Additional file 2, the results below, and results discussed in main text) or combining all samples (34h plus 42h embryos; results in Additional file 2, plus noted in Cyp26-3 analysis below). PCA applied to transcriptomes from the Pdx experiment separated samples according to genotype along Principal Component 1 (mutant vs. wild type). One wild type sample (BfPdx.2 WT) grouped aberrantly, especially along Principal Component 2, and was excluded from Differential Gene Expression analysis as a putative outlier ( Figure S18).
Figure S18: Principal Component Analysis (PCA) Biplot of Pdx mutant and wild type RNAseq data sets. Principal Component 1 primarily separates samples according to genotype (wild type vs. mutant);wild type sample 2 (BfPdx.2 WT) is a putative outlier. .

(b) Analysis of GFP genes affected by mutation of amphioxus Pdx
In Pdx 4Δ/11Δ mutants, we found down-regulation of reads mapping to 11 contigs from the GFP gene family (GE_G14886, FE_G16436, NOVEL_103923, NOVEL_102722, NOVEL_50813, ML_G28518,  FE_G16645, FE_G16414,   The most likely identities for these contigs are GFP-8, GFP-10 and/or GFP-13. The single match to GFP-1 is weaker and derived from a short contig so is less reliable. It should not be concluded that all three best hit genes (GFP-8, GFP-10 and GFP-13) are changing in expression level, however, because four amphioxus genes -GFP-8, GFP-10, GFP-12 and GFP-13 -have highly similar nucleotide sequences and are the product of recent tandem gene duplication [63]. Short read sequence data, as generated in this study, cannot be unambiguously assigned to a particular gene and reads will be split by multimapping between them. We conclude that one, or more, of the genes GFP-8, GFP-10, GFP-12 and GFP-13 has been down-regulated in expression following Pdx mutation.

(c) Analysis of insulin signalling pathway genes affected by mutation of amphioxus Pdx
i. Signalling peptides. We found three contigs with partial sequence similarity to insulin-like peptide (ILP) genes (FE_G15481, GE_G16157 and FE_G15511); these are expressed ~2 to 11 fpkm. NCBI blastn analysis vs nr database (31/7/19) suggests the first two represent genes related to ILP; the third (FE_G15511) is true ILP. None of these contigs showed significant expression level changes in Pdx mutants (Additional file 2).
ii. Binding proteins. SuperTranscript M_G27744 encompassed three related transcripts (Table  S6), each with stretches of 100% identity to a gene annotated as insulin-like growth factor binding protein 7 (IGFBP7) in B. belcheri (XM_019790737, LOC109486832) and its homologue in B. floridae (XM_002608818.1). These contigs showed clear down-regulation in Pdx mutants. A second superTranscript FE_G33063 also matched this gene and showed similar down-regulation (Table S6).

(d) Analysis of iLBP, RAR, RXR, Cyp26, Rootletin and T-box gene expression in amphioxus Cdx mutants
In Cdx -/mutants (42h analysis), we found 1.6-to 13.5-fold down-regulation of reads mapping to eight contigs from the intracellular lipid-binding protein (iLBP) gene superfamily which in vertebrates include CRABP (Cellular retinoic acid-binding protein), CRBP (cellular retinol-binding proteins) and FABP (Fatty acid-binding proteins). These contigs are FE_G16051, GE_G22703, ML_G19061, NOVEL_10269, NOVEL_50856, NOVEL_65559, NOVEL_74162, NOVEL_82533. To analyse whether these are variants of the same amphioxus gene, or multiple genes, BLASTX was used vs NCBI nr (30-31/7/2019). This revealed that seven contigs represent the gene iLBP4 and one represents iLBP6. Rootletin gene expression is clearly down-regulated in the transcriptome of Cdx -/mutants, indicating that once down-regulated in mutant embryos it does not recover at later developmental stages. We found 7 contigs for the same gene, matching different isoforms or incomplete transcripts (ML_G1957, ML_G8904, ML_G8919, NOVEL_38192, GE_G1583, GE_G1727, NOVEL_60824). All were significantly down-regulated in Cdx -/mutants. The longest contigs are summarized in Table S7.
Contigs for most T-box genes showed no significant change in expression level, apart from Brachyury-2 (1.46 fold up-regulation, from 6.8 mean fpkm in wild type to 9.9 mean fpkm in Cdx mutants; Table S7).  First, we note that absolute levels of expression differ greatly between Hox genes in wild-type embryos, as estimated from transcriptome read mapping (calculated here from the Cdx experiment 42h samples). Only Hox1, Hox3, Hox4 and Hox6 have expression levels above 10 fpkm; Hox2, Hox5 and Hox7 have fewer reads counts and the more 'posterior' paralogy group genes barely any (Table  S8; Figure S19).   Figure S20). Second, we detect a colinear-like response of Hox genes to mutation of Cdx, with paralogy group 1 gene expression higher in mutants, Hox2, Hox3 and Hox4 unaffected, Hox5 and Hox6 mildly downregulated, and Hox7 strongly down-regulated (Table S8). Only the Hox1 and Hox6 expression changes are significant when each gene is considered one at a time (Table S8); considering genes as a cluster and plotting mean changes collectively reveals a significant negative slope to the response ( Figure S21). B. floridae Hox1 to Hox7, log2 fold change in expression following Cdx mutation

SECTION 8: DIFFERENTIAL EXPRESSION OF GUT-ENRICHED GENES
Gut-enriched genes were identified from published B. lanceolatum RNA-seq data (NCBI GEO GSE106430;[69]). We defined a gene as gut-enriched if it had a higher mean expression level in gut than in any other adult tissue (eggs and embryos excluded) and if expression level in gut was at least twice the expression level in seven out of eight other tissues (neural tube, muscle, gill bars, hepatic diverticulum, testis, ovary, skin, cirri; Additional file 2: Supplementary Data, Tab 8). This gave 2083 B. lanceolatum gut-enriched genes (listed in Additional file 2: Supplementary Data with expression levels; Tab 8 and Tab 9). These were matched to contigs in the current study using blastn with an evalue cut off of 1 e-70 giving 4705 gut-enriched B. floridae superTranscripts, also listed in Additional file 2: Supplementary Data, Tab 9.
We asked if the genes affected by Pdx or Cdx mutation were enriched in gut-enriched genes. There are 5831 superTranscripts that are expressed differentially between wild type and Pdx mutant larvae; of these 482 are also found in the set of 4705 gut-enriched contigs. Similarly, there are 1428 superTranscripts expressed differentially between wild type and Cdx mutant larvae; of these 218 are found in the set of 4705 gut-enriched contigs. This equates to 8.3% of the Pdx differentially expressed contigs and 15.3% of the Cdx differentially expressed contigs.
To test if 8.3% and 15.3% represents enrichment, we ran 1000 simulations of each sampling (5831 or 1428 sequences chosen randomly from the superTranscriptome) and assessed overlap with the gutenriched dataset. Mean overlaps were 3.3%, with the experimental data being a highly significant outlier in each case (arrows in Figure S22). Hence, mutation of Pdx or Cdx has a disproportionate and significant effect on expression of gut-enriched genes. Figure S22: Mutation of amphioxus Pdx or Cdx has a disproportionate effect on gut-enriched genes. The degree of overlap (orange and green arrows) is outside the range of overlap generated by random sampling (orange and green curves).

Pdx Cdx
We also searched the 'gut-enriched, differentially expressed' data sets for genes implicated in gut cell function. To do this, we first reduced the data sets to well-annotated protein-coding genes by (a) clustering highly similar sequences (>90%) by cd-hit-est, and (b) excluding genes without proteincoding annotations in the IncDNA-BF database [92,93]. This reduced the Pdx 'gut-enriched, differentially expressed' from 482 contigs to 218 genes, and the Cdx 'gut-enriched, differentially expressed' from 218 contigs to 92 genes. Putative identities are given in Tab 10 and Tab 11 of Additional file 2: Supplementary Data. Examples from the Pdx data set are given in Table S9.