Complete representation of a tapeworm genome reveals chromosomes capped by centromeres, necessitating a dual role in segregation and protection

Background Chromosome-level assemblies are indispensable for accurate gene prediction, synteny assessment, and understanding higher-order genome architecture. Reference and draft genomes of key helminth species have been published, but little is yet known about the biology of their chromosomes. Here, we present the complete genome of the tapeworm Hymenolepis microstoma, providing a reference quality, end-to-end assembly that represents the first fully assembled genome of a spiralian/lophotrochozoan, revealing new insights into chromosome evolution. Results Long-read sequencing and optical mapping data were added to previous short-read data enabling complete re-assembly into six chromosomes, consistent with karyology. Small genome size (169 Mb) and lack of haploid variation (1 SNP/3.2 Mb) contributed to exceptionally high contiguity with only 85 gaps remaining in regions of low complexity sequence. Resolution of repeat regions reveals novel gene expansions, micro-exon genes, and spliced leader trans-splicing, and illuminates the landscape of transposable elements, explaining observed length differences in sister chromatids. Syntenic comparison with other parasitic flatworms shows conserved ancestral linkage groups indicating that the H. microstoma karyotype evolved through fusion events. Strikingly, the assembly reveals that the chromosomes terminate in centromeric arrays, indicating that these motifs play a role not only in segregation, but also in protecting the linear integrity and full lengths of chromosomes. Conclusions Despite strong conservation of canonical telomeres, our results show that they can be substituted by more complex, species-specific sequences, as represented by centromeres. The assembly provides a robust platform for investigations that require complete genome representation.

ParaSite (WBP) [11] in 2015 (details of the v2 assembly are described in [8]). Here we present 64 the third major release of the genome; a reference quality update to the assembly that was 65 made available to the public with the 12 th release of WBP (December 2018). The genome has 66 been assembled into full chromosomes, based on the addition of long-read sequence data to 67 previous short-read data followed by extensive alignment, manual review and re-assembly 68 guided by optical mapping data. With this release, H. microstoma represents the most 69 completely assembled genome of the lophotrochozoan superphylum. 70 assemblies of other parasitic flatworms: 62% with the hydatid tapeworm Echinococcus 96 multilocularis (v4) and 47% with the human blood fluke Schistosoma mansoni (v7) ( Table 1,  97   Table S2). Compared with the v1 and v2 assemblies, this amounts to 8% and 6% more one-to-98 one orthologues with E. multilocularis and 12% and 6% more with S. mansoni, respectively. 99 Overall, the number of genes and average intron and exon size of the v3 proteome is most 100 consistent with the v1 release, whereas the v2 annotation contained an inflated gene count. This 101 indicates that the gene model estimates have stabilized, and together with the assembly and 102 proteome completeness metrics, reflects the advanced level to which the annotation of coding 103 regions has been completed for this genome. A full list of H. microstoma gene models and 104 annotations together with E. multilocularis orthologues is given in Table S3. 105 Consistent with the expansion of previously under-represented repeat arrays discussed 106 below, we find that 99 genes previously present as single copies now exist as families with at 107 least three paralogues (Fig. S2, Table S4). Amongst the 12 families with the largest expansions 108 (≥ 5-fold) compared with the v1 genome, a notable example is a C2H2-type zinc finger gene that 109 now has ten copies where previously there was just one. Three families (encompassing 16 110 genes in v3 but only 3 in v1) are similar to major vault proteins -a cytoplasmic ribonuclear 111 protein complex -and seven families have no obvious sequenced-homologs in other organisms 112 and potentially represent proteins with novel biological functions. 113 Using the Benchmarking Universal Single-Copy Orthologs (BUSCO) approach [19], 77% 114 of expected genes were identified as complete and without duplication (Table S5). This 115 compares favourably with the manually finished reference genomes of E. multilocularis (70%) 116 and S. mansoni (73%); completeness scores for parasitic flatworms always fall considerably 117 short of the 100% benchmark. It is therefore likely that many suggested 'core' metazoan genes 118 have been lost or have significantly diverged in the flatworm lineage, rather than being 119 erroneously absent from these assemblies. For example, of the 178 BUSCO core genes missing 120 from the v3 assembly, 160 are also missing from E. multilocularis and 135 from S. mansoni 121 (Table S6). Another factor is likely to be that the lophotrochozoan superphylum is represented by 122 only three species in the BUSCO metazoan database (v3.0.2: two molluscs and one annelid 123 worm). Such under-representation of one of three superphyla may be biasing the circumscription 124 of 'core' genes in the Metazoa. 125 Previously generated RNA-seq data representing different life cycle stages and regions 126 of the adult, strobilar worm were re-mapped to the new v3 assembly and proteome and the 127 resulting and tight clustering among sample replicates based on principal component analyses (Fig. S4A). 132 Heat map analyses (Fig. S4B) indicate that the transcriptome of the scolex-neck region of the 133 adult is more similar to that of the metamorphosing larvae than to the mid or end reproductive 134 regions of the adult, and this was also shown to be supported by subsets of genes representing 135 signalling pathways and transcription factors as discussed in [ Although the addition of long read data in the present assembly enabled full resolution of 152 many more repeat arrays than in previous versions, the depth of coverage of reads realigned to 153 the genome assembly is inordinately high in many places ( Fig. 1) indicating that for some 154 repeats, multiple sequenced copies are aligning to fewer copies in the assembled consensus. 155 The true size of some of the largest repeat arrays therefore remains under-represented, 156 including the ribosomal RNA, telomeric and centromeric arrays. Two of the largest examples are 157 on Chr1 (38.9-40.7 Mb) and Chr3 (0.75-4.2 Mb) that are currently assembled into sequences 158 less than half of their expected size based on the relative depth of coverage (labelled A and B, 159 respectively, on Fig. 1). In contrast, Chr4 is notable in having a low proportion of repeats; only 160 14% of the chromosome is classified as repeat compared with 21-28% across the other 161 chromosomes. The ribosomal RNA array located on Chr2 stands out as the most prominent 162 single repeat type, with an assembled length of 767 kb (0.45% of the assembly). However, its 163 true size based on depth of sequence coverage is likely to be closer to 7.5 Mb (4.4% of the 164 genome), further discussed below. 165 Repeat content in the first published tapeworm genomes was reported at 7-11%, of 166 which only 2% was attributed to TEs [6]. This proportion of repeats and TEs is exceptionally low 167 and was most likely a reflection of both the inability to fully resolve repetitive regions using short-168 read data and differences in the identification of TEs. Although TE content is highly variable both 169 across and within animal taxa [21], estimates here of ~25% of the genome content is more 170 typical of metazoans in general and closer to that reported for S. mansoni (~35%) [1]. 171

Variable repeat regions explain length discrepancies in sister chromatids 172
It was noted from karyology that sister chromatids are not equal in length [14] and that this was 173 especially visible in the largest pair [15]. Although these studies could not rule out the possibility 174 that such differences resulted from the squash technique employed, our sequence data 175 corroborate their observations; whereas we see little to no sequence variation in our assembled 176 contigs, optical mapping data suggest that the largest tandem repeats, which remain elusive to 177 full resolution, could have differing lengths in each pair of sister chromatids. For example, while 178 an optical contig spans the rRNA repeat on Chr2 (the second largest chromosome), giving a 179 short 200 kb form with 17 copies, another optical contig extends into but not across the array, 180 and likely represents the longer version of a larger, alternative haplotype (Fig. S5). It is not 181 possible to directly measure the length of this latter copy but using mapped coverage of Illumina 182 reads from a single library, Chr2 has a median coverage depth of 96x, yet there is a median 183 coverage of 754x over the 486 kb region containing the repeat. We therefore extrapolate that the 184 repeat region exists in the sister chromatid as sequence close to 7.5 Mb. Thus sister chromatids 185 from Chr2 could vary in length by ~25% due to dimorphism in this one repeat region alone. 186 Several other less extreme cases of optical contigs giving two different lengths for the same 187 locus are apparent in the whole genome optical map (Fig. S6), and there are other large repeat 188 regions whose full size is not currently known that could contribute further to homologous 189 chromosomes having unequal lengths. 190

Micro-exon genes are identified in the v3 assembly 191
Genes containing micro-exons that code for as little as a single amino acid occur throughout 192 biology [22]. However, the term micro-exon gene (MEG) was coined for a class of gene that was 193 first identified in the genome of S. mansoni [1] and subsequently in E. multilocularis  (Table S10). Ten of the MEGs with 14 transcripts are found in a single 201 region of Chr6 (2,643,059-3,072,453) and all share a conserved amino acid sequence motif 202 (consensus: MRLFILLCFAVTLWACPKQCP) that indicates that they belong to a single gene 203 family that expanded via tandem duplication (Fig. S7). A concerted effort to identify and curate 204 MEGs across several flatworm lineages is a high priority for trying to find clues to the functional 205 roles of this numerous yet poorly understood class of genes. However, as many MEGs contain 206 repetitive sequences they are a challenge to analyse without extensive manual curation and at 207 present orthogroups can not be determined with confidence. 208

RNA-seq data demonstrate evidence of spliced leader trans-splicing 209
Spliced-leader (SL) trans-splicing is an mRNA maturation process in which a 5' donor sequence 210 encoded by its own locus (i.e. the splice leader gene) is spliced to the 5' exons of other gene 211 transcripts and was first identified in tapeworms by Brehm et al. [24]. We identified the presence 212 of SL trans-spliced transcripts in the transcriptomes of adult and larval H. microstoma for the first 213 time. We hypothesised that leader sequences would be present in total RNA-seq libraries and 214 identifiable by their abundance in soft-clipped read segments after alignment to the genome. 215 Using this approach we successfully recovered the previously identified E. multilocularis and S. 216 mansoni SL sequences [24,25] from analyses of publicly available RNA-seq libraries (Fig. S8A). 217 Our method identified 3,876 genes as being putatively trans-spliced in S. mansoni on the basis 218 of having at least one SL-associated read across all of the libraries analysed, reducing this to a 219 conservative set of 1,219 genes with at least ten SL-associated reads. This is comparable with 220 previous estimates of trans-splicing in S. mansoni based solely on total RNA-seq libraries [25]. 221 For E. multilocularis, 1,609 genes were identified with ≥ 1 SL-associated read and 527 with ≥ 10 222 reads. 223 Clustering soft clipped read segments from H. microstoma resulted in three abundant 224 clusters, referred to as SL1, SL2 and SL3 (Fig. S8A). Screening these 23-27 bp putative SL 225 sequences against the genome showed that the SL1 motif is found in each of the two exons that 226 comprise gene model HmN_002290900 (Chr1), SL2 is found in an intronic region associated 227 with gene model HmN_000738800 (Chr3), and SL3 is found in a single exon associated with 228 gene model HmN_000738800 (Chr1). No other region in the genome contained these 229 sequences. Based on these SL sequences we identified 1,341 genes with ≥ 1 read and 496 230 genes with ≥ 10 reads as being putatively trans-spliced. Of the latter, 449 were associated with 231 all three SL sequences, having at least one read of each SL aligned. Similarly, the total number 232 of trans-spliced transcripts found for each SL was highly similar (SL1 = 18,831, SL2 = 18,725, 233 SL3 = 19,241) and the use of 'interchangeable' alternative SL forms was also reported for E.  Table S11. Notably, we found that libraries derived from larval H. mansoni as an outgroup, we can infer that the three tapeworm breaks in synteny are fusions (H1 268 cf. E1+8, H5 cf. E5+7, and H6 cf. E6+9) as the synteny blocks that have fused to make these H. 269 microstoma chromosomes exist separately in the blood fluke (Table S12). In addition to three 270 fusion events, synteny evidence allows us to unambiguously order and orientate two scaffolds 271 from the E. multilocularis assembly to form a single chromosome, corresponding to a single 272 ancestral linkage group (labelled E9 in Fig. 1B and G in Fig. 3C). By doing so, the E. and Zea mays, 156 bp.); its sequence is species-specific and highly conserved across 296 chromosomes [36] (with the exception of Chr2 discussed below); and there is only one, large 297 repeat array per chromosome. Moreover, among the sequences that contain this repeat we only 298 find a single junction from unique sequence into the repeat and no junction out of it into another 299 sequence as we find in all other repeats in the genome, and hence it represents a terminal 300 sequence. Finally, we note that in each chromosome the orientation of the repeat remains 301 constant relative to the telomere. That is, by aligning the chromosomes by their telomeric ends 302 (requiring reverse complimenting of Chr1 and Chr2; see Fig. 1) the centromeric sequences are 303 also in alignment. Using the first published assembly [6] and purely algorithmic means (i.e. high 304 copy number, large tandem repeats), this same motif was independently predicted to be the 305 centromere by Melters et al. [37]. We estimate the total size of each repeat array to be at least 306 5.5 Mb. 307 Whereas five of the chromosomes have identical motifs, Chr2 contains not only the same 308 novel centromere motif but also a second dominant motif (Fig. S9). In addition, the array is larger 309 and interspersed with other repetitive elements (e.g. gag pol polyprotein) and has a larger sub-310 telomeric region (Fig. S10). To corroborate our results we used chromosomal fluorescent in situ 311 hybridisation (FISH) with probes against the canonical telomeric sequence, showing that only 312 one telomere array is present on each chromosome (Fig. 4A) and that it is opposite to the joined 313 ends of sister chromatids (Fig. 4B), as predicted by our assembly. 314

315
Such a highly resolved assembly is still unusual and is a product of not only long-read sequence 316 data and optical mapping but also a process of manual improvement. Using Gap5 [38], we were 317 able to scrutinise sequence assemblies from the level of individual base pairs up to whole 318 chromosomes, facilitating diagnosis and resolution of mis-assemblies as well as enabling further 319 scaffolding from clues contained in the read coverage and read-linking data. In this way we 320 have, unusually, been able to place all of the generated sequence data into a chromosomal 321 location, leaving an assembly that is resolved into the same number of scaffolds as the 322 karyotype, with a combined coverage of over 300x. Moreover, although 85 gaps remain there is 323 strong evidence that no novel, complex sequence is missing from the assembly. Assembly was 324 further aided by exceedingly low levels of haploid variation, with only 52 SNPs present in the 325 entire genome. Such low intraspecific genetic variation is very unusual and is presumed to be 326 the result of sequencing a highly inbred laboratory strain [13]. to be found in the rapid evolution of the sub-telomeric and peri-centrosomal repeats that 357 accompany these arrays [36,42] and it is becoming increasingly clear that despite their functions 358 being perfectly conserved, centromeric and telomeric regions undergo highly dynamic evolution 359 driven by TEs [47]. 360

361
Third generation sequencing technologies have enabled the production of highly contiguous 362 genome assemblies that provide more accurate estimates of content as well as the ability to 363 investigate syntenic relationships and other higher-order features of genome architecture. With 364 the third release of the Hymenolepis microstoma genome we have produced a reference quality, 365 end-to-end assembly that provides complete chromosomal representation. The hybrid assembly 366 has stabilised estimates of the proteome and non-coding regions and represents a resource 367 effectively free from sampling error. The release thus provides a robust platform to begin 368 systems-level analyses in parasitic flatworms and to this end has been recently used to infer 369 protein-protein interactions based on functional data gathered from major model systems [48]. Genomic DNA for optical mapping was extracted from agarose-embedded specimens 406 using the CHEF Genomic Plug DNA kit (BioRad) in order to minimise fragmentation. Four 407 samples were prepared, using 500 and 1,000 larvae (i.e. fully patent cysticercoids harvested 408 from beetles), and 3 (6.6 mg damp weight) and 7 (10.9 mg) sections of adult worm (anterior ~2 409 cm each; as above). 2% CleanCut (BioRad) agarose was melted at 70°C then cooled to 50°C. process as input. These assemblies were then passed to Metassembler for merging, using the 444 HGAP4 assembly as the primary assembly and the Canu assembly as the secondary 445 assembly. The resulting sequence assembly was passed to Bionano's Hybrid (optical map) 446 Scaffolder. In addition, an Illumina-only SpAdes assembly was produced [52]. 447

Manual genome improvement 448
The genome was manually improved by examining the optical map data in Bionano's Access 449 software and the sequence data in Gap5 [38]. Errors in the assembly were identified where 450 scaffold breaks needed to be made, or places where new joins could be made. Where groups of 451 Illumina reads mapped to contig ends without their mate-pair, the SpAdes assembly was queried 452 to recover data missing from the assembly. All assembly edits resulting from such investigations 453 were made in Gap5. Soft-clipped reads (PacBio and Illumina) at contig ends were also unclipped 454 where they were found to be in agreement with each other. Many rounds of extending soft-455 clipped data, remapping, and checking, followed by further extension were undertaken and the 456 results of these incremental improvements were fed back to the Hybrid Scaffolder. there were four junctions from non-repetitive sequence into these repeats. In this instance, a 471 scaffold path was chosen that followed synteny with E. multilocularis and S. mansoni, given that 472 only three real synteny breaks were found elsewhere. 473 Extensive optical alignment was used to confirm assembly accuracy (Fig. S6). Apart from 474 three large repeat regions (A, B and rRNA repeat), effectively the entire genome had very good 475 alignment with optical contigs. Some additional gaps remained in the alignments due to large 476 repeats. Optical contigs were much shorter than sequence scaffolds due to a known issue 477 whereby nick sites that occur close together on opposite strands introduce systematic double-478 stranded breaks that limit the contiguity of Bionano optical maps [54]. 479 This assembly approach yielded the nuclear plus mitochondrial genomes with n = 7 and 480 with 85 sequence gaps remaining, most likely containing repetitive sequence. The mitochondrial 481 contig was circularised to Cox1 (Fig. S1). 482

Gene finding and annotation 483
Given the fragmented nature of the v1 assembly and questions around the veracity of the v2 484 annotation set that had 2,000 additional gene models compared with either the v1 gene models 485 or those for E. multilocularis, we opted to generate a de novo annotation with Braker2 [16] using 486 RNA-seq data as input (for raw data accessions see S1. and it was noted that many of these models had fallen near to, but just below, the 97.5% 493 threshold mentioned above, and upon inspection were generally found to result from incorrect 494 annotation of gene models in tandem repeats and so were removed. OrthoMCL [58] was used to 495 find one-to-one gene mappings between the resulting annotation and the previous v1 and v2 496 gene models. Where unambiguous mappings were found, the historical gene IDs were 497 transferred and are thus consistent with previous releases. Where mappings were ambiguous or 498 non-existent, new gene IDs were created prefixed with '003' (e.g . HmN_003NNNNNN). The 499 mitochondrial genome was annotated independently using Mitos2 [59]. 500 The distribution of repeats were subsequently analysed using RepeatModeller (v1.0.11) followed 501 by RepeatMasker (v4.0.7). 502

Analysis of synteny conservation between flatworms 503
The S. mansoni genome assembly v7 (PRJEA36577) and the latest E. multilocularis assembly 504 were obtained from WBP (release 12). Translated alignments of 100 kb windows from each H. From this, we calculated a grand total of 5.5 Mb which we take to be a minimum size estimate 520 for this repeat, in line with the expectation that the centromere repeat is likely to be the largest 521 repeat in the genome [37]. 522

Identification of micro-exon genes 536
Custom shell and Perl scripts were used to download and parse GFF-formatted annotation from 537 WBP (July 2019) to create a table of exon lengths for each gene. The resulting table was further 538 parsed to identify exons shorter than 70 nucleotides and divisible by three as micro-exons. 539 Genes comprising at least seven exons, with micro-exons constituting at least half of all exons 540 and runs of at least four consecutive micro-exons were deemed to be micro-exon genes. For 541 more information see https://github.com/wbazant/microexons/blob/master/README.md. 542

Identification of splice leader sequences and trans-spliced genes 543
Publicly available RNA-seq libraries were used to identify splice leader sequences in E. Office to PDO. 715

Consent for publication 716
Not applicable 717

Availability of data and material 718
The datasets generated and/or analysed during the current study are available in the European 719 Nucleotide Archive (www.ebi.ac.uk/ena) under the following accessions; genome assembly 720 GCA_000469805.3, long read sequence data study accession PRJEB2107. 721 situ hybridisation; KJ conducted differential expression analyses; FHR analysed spliced leader 730 trans-splicing; SRD performed preliminary analyses of synteny and advised on annotation and 731 analytical approaches; NEH coordinated the specimens and sequencing; AT, AB, PDO and MB 732 interpreted results and prepared the paper which was led by PDO and AT. All authors read and 733 approved the final manuscript. 734 Tables  Table 1 Assembly metrics among Hymenolepis microstoma genome releases 897