Experimental design
This study was undertaken to generate a high-quality reference genome assembly and annotation for the cat flea, C. felis, and represents the first sequenced genome for a member of the order Siphonaptera. Our approach leveraged a combination of long-read PacBio sequencing, short-read Illumina sequencing, and Hi-C (Chicago and HiRise) data to construct a chromosome-level assembly; RNA-Seq data and BLAST2GO classifications to assist in gene model prediction and annotation; sequence mapping to address assembly fragmentation and short scaffolds (< 1 Mb); and ortholog group construction to explore a genetic basis for the cat flea’s parasitic lifestyle. Gene duplications were confirmed via orthogonal approaches, including genome size estimates of individual fleas, gene-based read coverage calculations, genomic distance between duplications, and correlation between duplications and repeat elements or contig boundaries.
Genome sequencing and assembly
Newly emerged (August 2017), unfed female C. felis (n = 250) from Elward Laboratories (EL; Soquel, CA) were surface-sterilized for 5 min in 10% NaClO followed by 5 min in 70% C2H5OH and 3× washes with sterile phosphate-buffered saline. Fleas were flash-frozen in liquid N2 and ground to powder with sterile mortar and pestle. High-molecular weight DNA was extracted using the MagAttract HMW DNA Kit (QIAgen; Venlo, Netherlands), quantified using a Qubit 3.0 fluorimeter (Thermo Fisher Scientific, Waltham, MA), and assessed for quality on a 1.5% agarose gel. DNA (50 μg) was submitted to the Institute for Genome Sciences (University of Maryland) for size selection and preparation of sequencing libraries. Libraries were sequenced on 12 SMRT cells of a PacBio Sequel (Pacific Biosciences; Menlo Park, CA), generating 7,239,750 reads (46.7 Gb total). Raw reads were corrected, trimmed, and assembled into 16,622 contigs with Canu v1.5 in “pacbio-raw” mode, using an estimated genome size of 465 Mb [31]. The second group of newly emerged (January 2016), unfed female EL fleas (n = 100) was surface-sterilized and homogenized as above, and genomic DNA extracted using the QIAgen DNeasy® Blood and Tissue Kit (QIAgen, Hilden, Germany). DNA was submitted to the WVU Genomics Core for the preparation of a paired-end 250-bp sequencing library with an average insert size of 500 bp. The library was sequenced on 4 lanes of an Illumina HiSeq 1500 (Illumina, Inc., San Diego, CA), generating 450,132,548 reads which were subsequently trimmed to remove adapters and filtered for length and quality using FASTX-Toolkit v0.0.14 (available from http://hannonlab.cshl.edu/fastx_toolkit/). These short-read data were used to polish the Canu assembly with Pilon v1.1.6 in “fix-all” mode [63] and to determine the composition of the C. felis microbiome (see below). Haplotigs in the polished contigs were resolved using purge_haplotigs [32] with coverage settings of 5 (low), 65 (mid), and 180 (high). A third group of newly emerged (Feburary 2018), unfed female EL fleas (n = 200) were surface-sterilized as above, frozen at − 80 °C, and submitted for Chicago and Dovetail Hi-C proximity ligation (Dovetail Genomics, Santa Cruz, CA) [64] using the polished Canu assembly as a reference. The resulting scaffolded assembly (3926 scaffolds) was subjected to the removal of microbial sequences as described in the next section.
Genome decontamination
A comparative BLAST-based pipeline slightly modified from our prior work [65] was used to identify and remove microbial scaffolds before annotation. Briefly, polished contigs were queried using BLASTP v2.2.31 against two custom databases derived from the nr database at NCBI (accessed July 2018): (1) all eukaryotic sequences (eukDB) and (2) combined archaeal, bacterial, and viral sequences (abvDB). For each query, the top five unique subject matches (by bitscore) in each database were pooled and scored according to a comparative sequence similarity measure, Sm:
where b is the bitscore of the match, I is the percent identity, and Q is the percent aligned based on the longer of the two sequences. The top 5 scoring matches from the pooled lists of subjects were used to calculate a comparative rank score C for each individual query q against each database d:
$$ C\left(q,d\right)=\frac{2\left({\sum}_n^{i=1}\left(n-{r}_i\left(q,d\right)\right)+1\right)}{n\left(n+1\right)} $$
where ri(q,d) is the rank of subject i for query q against database d. For example, if all of the top n matches for query q are in eukDB then C(q,eukDB) = 1; conversely, if none of the top n matches is in database abvDB then C(q,abvDB) = 0. Finally, each query q was scored according to a comparative pairwise score P between 1 purely eukaryotic) and − 1 (purely microbial):
$$ P=C\left(q,\mathrm{eukDB}\right)-C\left(q,\mathrm{abvDB}\right) $$
Scaffolds that contained no contigs with P > 0.3 (n = 183), including 5 Wolbachia-like scaffolds, were classified “not eukaryotic” and set aside. Scaffolds that contained contigs with a range of P scores (n = 32) were manually inspected to identify and remove scaffolds arising from misassembly or contamination (n = 10). The remaining scaffolds (n = 3733) comprised the initial draft assembly for C. felis and were deposited in NCBI under the accession ID GCF_003426905.1.
Genome annotation
Assembled and decontaminated scaffolds were annotated with NCBI Eukaryotic Genome Annotation Pipeline (EGAP) v8.1 (https://www.ncbi.nlm.nih.gov/books/NBK143764/). To facilitate gene model prediction, we generated RNA-Seq data from 6 biological replicates of pooled C. felis females (Heska Corporation, Fort Collins, CO). Briefly, total RNA was isolated and submitted to the WVU Genomics Core for the preparation of paired-end, 100-bp sequencing libraries using ScriptSeq Complete Gold Kit for Epidemiology (Illumina, Inc., San Diego, CA). Barcoded libraries were sequenced on 2 lanes of an Illumina HiSeq 1500 in High Throughput mode, yielding approximately 26 million reads per sample (Q > 30). Raw sequencing reads from all 6 samples were deposited in NCBI under the BioProject accession PRJNA484943. In addition to these data, the EGAP pipeline also integrated previously published C. felis expression data from the 1KITE project (accession SRX314844 [2];) and an unrelated EST library (Biosample accession SAMN00161855). The final set of annotations is available as “Ctenocephalides felis Annotation Release 100” at the NCBI.
Genome completeness and deflation
The distribution of scaffold lengths in our assembly, together with the relatively large number of fleas in our sequenced pool, warranted evaluating short scaffolds as possible sources of genomic heterogeneity among individual fleas. To address this possibility, assembly scaffolds shorter than 1 Mb (n = 3724) were mapped to scaffolds larger than 1 Mb (n = 9; the BIG9) with BWA-MEM v0.7.12 [66] using default parameters (Additional file 1: Fig. S1A). Additionally, genome completeness of the full assembly compared to just the BIG9 scaffolds was assessed with Benchmarking Using Single-Copy Orthologs (BUSCO) v3.0.2 [30] in “protein” mode, using the eukaryota_odb9, arthropoda_odb9, and insecta_odb9 data sets (Additional file 1: Fig. S1B). Isoforms were removed before BUSCO analysis by identifying CDSs that derived from the same protein-coding gene and removing all but the longest sequence.
Assessing the extent of gene duplication
Proteins encoded on the BIG9 scaffolds (n = 16,518) were queried against themselves with BLASTP v2.2.31 using default parameters. Pairs of unique sequences that met or exceeded a given amino acid percent identity (%ID) threshold over at least 80% of the query length were binned together. Bins of sequence pairs that shared at least one sequence in common were subsequently merged into clusters. Isoforms were removed after clustering by identifying CDSs in a cluster that derived from the same protein-coding gene and removing all but the longest sequence. This process was used to generate cluster sets at integer %ID thresholds from 90 to 100%. These duplicate protein-encoding genes were then mapped onto each of the BIG9 scaffolds using Circos [67] (Additional file 1: Fig. S1C-K). Cluster diameters were calculated as the number of non-cluster genes that lie between the edges of the cluster (i.e., the two cluster genes that are farthest apart on the scaffold) (Additional file 1: Fig. S1L). Clusters that span multiple scaffolds (mapped across all BIG9 scaffolds in Additional file 1: Fig. S1M) defy an accurate calculation of diameter and were assigned a cluster diameter of − 1. In order to estimate the fraction of our assembly comprising gene duplications, cluster coverages (by %ID threshold) were calculated in three ways. First, the coverage by CDS was estimated by comparing the number of single-copy (protein-encoding) genes to the total number of clusters; the latter number is assumed to represent a theoretical set of minimal “seed” sequences. Second, the coverage by gene length was calculated as the total number of nucleotides encoding the proteins in each cluster (including introns and exons) minus the mean gene length (to account for a hypothetical “ancestor” gene). Finally, the coverage by genome region was estimated by adding i*(n − 1) to each calculation of coverage by gene length, where n is the number of genes in the cluster and i is the mean intergenic length across all BIG9 scaffolds (17,344 nt). In order to assess the possible enrichment of cellular functions among duplicated genes, clusters at the 90% ID level were compared to the remaining BIG9 proteins by Fisher’s exact test (corrected for multiple testing) which is integrated into the FatiGO package of BLAST2GO (see “Functional classification of C. felis proteins” section). GO categories were reduced to their most specific terms whenever possible.
Length variation within gene duplication clusters
Variability in intra-cluster CDS length was assessed in two ways. First, the length of each CDS in a cluster was compared to the longest CDS of the cluster, and the proportion of clusters with any truncation (> 1 AA) was calculated for each integer %ID threshold between 90 and 100% ID. Second, the mean and distribution of length differences (i.e., the extent of truncation) were calculated across all clusters for each integer %ID threshold between 90 and 100% ID.
Analysis of repeat regions
The extent and composition of repeat elements in the C. felis genome were assessed in two ways. First, proteins annotated in the GO category “DNA Integration GO:0015074” (including retrotransposons) were extracted, plotted by genomic coordinate on each BIG9 scaffold, and assessed for co-localization either with gene duplicates (see above) or near the ends of scaffolds (Additional file 1: Fig. S1N). Second, repeat elements were identified on the BIG9 scaffolds with RepeatMasker v4.0.9 (available from http://www.repeatmasker.org/) in “RMBlast” mode (species “holometabola”), using Tandem Repeat Finder v4.0.9 and the Repbase RepeatMasker (October 2018) and Dfam 3.0 databases (Additional file 1: Fig. S1O).
Codon usage and tRNA gene family analysis
Given the relatively large number of tRNA genes in our assembly, and the AT richness of our genome, we were interested in exploring connections between tRNA gene frequencies and codon usage. To this end, tRNA gene abundance on BIG9 scaffolds (n = 4358) was determined by binning genes into families according to their cognate amino acid and calculating the percent of each family compared to the total number of tRNA genes (Additional file 1: Fig. S1P). A similar approach was taken to quantify tRNA gene abundance by anticodon. TA richness of each anticodon was subsequently calculated as the percent of A+T bases in the anticodon corrected for the size of the tRNA family. Codon usage was calculated as the percent of total codons using the coding sequences for genes on the BIG9 scaffolds, with isoforms removed as described previously (Additional file 1: Fig. S1Q).
Functional classification of C. felis proteins
Protein sequences encoded on the BIG9 scaffolds (n = 16,518) were queried with BLASTP v2.2.31 against the nr database of NCBI (accessed July 2018) using a maximum e value threshold of 0.1. The top 20 matches to each C. felis sequence were used to annotate queries with Gene Ontology (GO) categories, Enzyme Classification (EC) codes, and protein domain information using BLAST2GO v1.4.4 [68] under default parameters. A local instance of the GO database (updated February 2019) was used for GO classification, and the online version of InterPro (accessed April 2019) was used for domain discovery, including InterPro, PFAM, SMART, PANTHER, PHOBIUS, and GENE3D domains; PROSITE profiles; SignalP-TM (signal peptide) domains; and TMHMM (transmembrane helix) domains. InterPro data was used to refine GO annotations whenever possible (Additional file 2: Table S1). A subset of C. felis proteins (n = 153) classified as “DNA repair” (GO:0006281) was identified and all child GO terms of these proteins tabulated (Additional file 2: Table S1). Assuming a linear relationship between genome size and the number of repair genes [69], we estimate C. felis has an enriched repertoire closer to that of a 3-Gb genome.
Genome size estimation
Estimations for flea genome size largely followed previously reported approaches [70]. For C. felis individuals, 1/20 of the flea head was combined with two standards: 1/20 of the head of a female (YW) Drosophila melanogaster (1C = 175 Mbp) and 1/20 of the head of a lab strain D. virilis female (1C = 328). The tissues were placed in 1 ml of cold Galbraith buffer and ground to release nuclei in a 2-ml Kontes Dounce, using 15 strokes of the “A” pestle at a rate of three strokes every 2 s. The resulting solution was strained through a 45 μm pore-size filter, stained for 3 h in the dark at 4 °C with 25 μl of propidium iodide, then scored for total red fluorescence using a Beckman-Coulter CytoFLEX flow cytometer. The average channel number of the 2C nuclei of the sample and standards were determined using the CytExpert statistical software. Briefly, the amount of DNA was estimated as the ratio of the average red fluorescence of the sample to the average red fluorescence of the standard multiplied by the amount of DNA (in Mbp) of the standard. The estimates from the two standards were averaged. At least 500 nuclei were counted in each sample and standard peak. The coefficients of variation (CV) for all peaks were < 2.0. Fluorescence activation and gating based on scatter were used to include in each peak only intact red fluorescent nuclei free of associated cytoplasmic or broken nuclear tags. Histograms generated for the largest and smallest determined genome sizes show the minimal change in position for the two standards, demonstrating the significant change in the relative fluorescence (average 2C channel number) between C. felis individuals (Additional file 3: Fig. S2).
Characterizing copy number variation
In order to test the hypothesis that our genome assembly represents an agglomeration of individuals with different levels of gene duplication, we used minimap2 [71] to map our short-read sequence data against the full scaffolded assembly. After extracting the mapped reads with samtools v0.1.19 [72], including primary and alternative mapping loci, a vector of sequence depth (in bases) per position was generated with the genomecov function of bedtools v2.25.0 [73]. Mean depths for all 16,518 protein-coding genes on the BIG9 scaffolds were calculated as total bases covering each gene divided by gene length. Finally, the mean depth across all duplicated genes was compared to the mean depth across all single-copy genes using a Student’s t test.
To evaluate the extent of gene duplication across different C. felis populations, reads from the 1KITE transcriptome sequencing project (NCBI Sequence Read Archive accession SRR921588) were mapped to the 3733 scaffolds from our assembly using HISAT2 v2.1.0 [74] under the --dta and --no_unal options. Mapped reads were sorted with samtools and abundance per gene calculated as transcripts per million reads (TPM) using stringtie v1.3.4d [74]. TPM values were binned and plotted against the number of duplicated (90% aa ID or higher) and single-copy genes in the BIG9 assembly.
Comparative genomics
Protein sequences (n = 1,077,182) for 51 sequenced holometabolan genomes were downloaded directly from NCBI (n = 47) or VectorBase (n = 3) or sequenced here (n = 1). Isoforms were removed before analysis wherever possible, by identifying CDSs that derived from the same protein-coding gene and removing all but the longest CDS. Genome completeness was estimated with BUSCO v3.0.2 in “protein” mode, using the insecta_odb9 data set. Ortholog groups (OGs; n = 50,118) were constructed in three sequential phases: (1) CD-HIT v4.7 [75] in accurate mode (-g 1) was used to cluster sequences at 50% ID; (2) PSI-CD-HIT (accurate mode, local identity, alignment coverage minimum of 0.8) was used to cluster sequences at 25% ID; (3) clusters were merged using clstr_rev.pl (part of the CD-HIT package). Proteins from C. felis that did not cluster into any OG (n = 4282) were queried with BLASTP v2.2.31 against the nr database of NCBI (accessed July 2018). Queries (n = 2170) with a top hit to any Holometabola taxon, at a minimum %ID of 25% and query alignment of 80%, were manually added to the original set of ortholog groups where possible (n = 2142) or set aside where not (n = 28). The remaining queries with at least one match in nr (n = 1318) were grouped by GO category level 4 and manually inspected; these included queries with top hits to Holometabolan taxa that did not meet the minimum %ID or query coverage thresholds. Finally, C. felis proteins with no match in nr (n = 766) were binned by query length. These last two sets (n = 2084) comprise the set of proteins unique to C. felis among all other Holometabola (Additional file 6: Table S3). Congruence between OG clusters and taxonomy was determined by calculating a distance (Euclidean) between each pair of taxa based on the number of shared OGs. The resulting matrix was scaled by classic multidimensional scaling with the cmdscale function of R v3.5.1 [76] and visualized using the ggplot package in R. Finally, pan-genomes were calculated for several key subsets of Holometabola: (1) C. felis alone (Siphonaptera), (2) Antliophora (Siphonaptera and Diptera), (3) Panorpida (Siphoanptera, Diptera, and Coleoptera), (4) all taxa except Hymenoptera, and (5) all Holometabola (Additional file 5: Table S2). In order to account for differences in genome assembly quality and taxon sampling bias, we define the pan-genome here as the set of all OGs that contain at least one protein from at least one taxon in a given order. These intersections were visualized as upset plots using UpSetR v1.3.3 [77]. Intersections of various holometabolous taxa that lack C. felis were computed to gain insight on possible reductive evolution in fleas (Additional file 4: Fig. S3, Additional file 5: Table S2).
Microbiome composition
A composite C. felis microbiome was estimated using Kraken Metagenomics-X v1.0.0 [78], part of the Illumina BaseSpace toolkit. Briefly, 105,256,391 PE250 reads from our short-read data set were mapped against the Mini-Kraken reference set (12-08-2014 version), resulting in 2,390,314 microbial reads (2.27%) that were subsequently assigned to best possible taxonomy (Additional file 7: Table S4).
Assembly of Wolbachia Endosymbiont genomes
Corrected reads from the Canu assembly of C. felis were recruited using BWA-MEM v0.7.12 (default settings) to a set of concatenated closed Wolbachia genome sequences (n = 15) downloaded from NCBI (accessed February 2018). Reads that mapped successfully were extracted with samtools v0.1.19 and assembled separately into seed contigs (n = 22) with Canu v1.5 using default settings. Gene models on these seed contigs were predicted using the Rapid Annotation of Subsystems Technology (RAST) v2.0 server [79], yielding two small subunit (16S) ribosomal genes that were queried with BLASTN against the nr database of NCBI to confirm the presence of two distinct Wolbachia strains. Seed contigs were further analyzed by %GC and top BLASTN matches in the nr database of NCBI and binned into three groups: C. felis mitochondrial (n = 1), C. felis genomic (n = 6), and Wolbachia-like (n = 15) contigs. The Wolbachia-like contigs were subsequently queried with BLASTN against the full C. felis assembly (before decontamination). A single Wolbachia-like contig (tig00000005; wCfeJ) containing one of the two distinct 16S genes was retrieved intact from the full assembly. It was removed from the primary assembly and manually closed by aligning the contig ends with BLASTN. Gaps in the aligned regions were resolved by mapping our short-read data to the contig with BWA-MEM (default settings) and manually inspecting the read pileups. Six additional contigs were also retrieved intact from the full assembly; these were likewise removed and manually stitched together using end-alignment and short-read polishing, resulting in a second closed Wolbachia genome (wCfeT). The remaining Wolbachia-like contigs (n = 8) were found to be fractions of much longer flea-like contigs; these were left in the primary C. felis assembly. Both wCfeJ and wCfeT sequences were submitted to the RAST v2.0 server for gene model prediction and functional annotation.
Phylogenomics of Wolbachia endosymbionts
Protein sequences (n = 66,811) for 53 sequenced Wolbachia genomes plus 5 additional Anaplasmataceae (Neorickettsia helminthoeca str. Oregon, Anaplasma centrale Israel, A. marginale Florida, Ehrlichia chaffeensis Arkansas, and E. ruminantium Gardel) were either downloaded directly from NCBI (n = 30), retrieved as genome sequences from the NCBI Assembly database (n = 13), contributed via personal communication (n = 8; Michael Gerth, Oxford Brookes University), or sequenced here (n = 2) (Additional file 7: Table S4). For genomes lacking functional annotations (n = 15), gene models were predicted using the RAST v2.0 server (n = 12) or GeneMarkS-2 v1.10_1.07 (n = 3 [80];). Ortholog groups (n = 2750) were subsequently constructed using FastOrtho, an in-house version of OrthoMCL [81], using an expect threshold of 0.01, percent identity threshold of 30%, and percent match length threshold of 50% for ortholog inclusion. A subset of single-copy families (n = 47) conserved across at least 52 of the 58 genomes were independently aligned with MUSCLE v3.8.31 [82] using default parameters, and regions of poor alignment were masked with trimal v1.4.rev15 [83] using the “automated1” option. All modified alignments were concatenated into a single data set (10,027 positions) for phylogeny estimation using RAxML v8.2.4 [84], under the gamma model of rate heterogeneity and estimation of the proportion of invariant sites. Branch support was assessed with 1000 pseudo-replications. Final ML optimization likelihood was − 183,020.639712.
Confirmation of the presence of wolbachiae in C. felis
To assess the distribution of wCfeJ and wCfeT in C. felis, individual fleas from the sequenced strain (EL) and a separate colony (Heska) not known to be infected with Wolbachia strains were pooled (n = 5) by sex and colony, surface-sterilized with 70% ethanol, flash-frozen, and ground in liquid N2. Genomic DNA was extracted using the GeneJET Genomic DNA Extraction Kit (Thermo Fisher Scientific, Waltham, MA), eluted twice in 50 μl of PCR-grade H2O, and quantified by spectrophotometry with a Nanodrop 2000 (Thermo Fisher Scientific, Waltham, MA). One hundred nanograms of DNA from each pool was used as a template in separate 25 μl PCR reactions using AmpliTaq Gold 360 (Thermo Fisher Scientific, Waltham, MA) and primer pairs (400 nmoles each) specific for (1) a 76-nt fragment of the cinA gene specific to wCfeJ (Fwd: 5′-AGCAACACCAACATGCGATT-3′; Rev: 5′- GAACCCCAGAGTTGGAAGGG-3′), (2) a 75-nt fragment of the apaG gene specific to wCfeT (Fwd: 5′- GCCGTCACTGGCAGGTAATA-3′; Rev: 5′- GCTGTTCTCCAATAACGCCA-3′), or (3) a 122-nt fragment of Wolbachia 16S rDNA (Fwd: 5′-CGGTGAATACGTTCTCGGGTY-3′; Rev: 5′-CACCCCAGTCACTGATCCC-3′). Primer specificities were confirmed with BLASTN against both the C. felis assembly and the nr database of NCBI (accessed June 2018). Reaction conditions were identical for all primer sets: initial denaturation at 95 °C for 10 min, followed by 40 cycles of 95 °C for 30 s, 60 °C for 30 s, and 72 °C for 30 s, and a final extension at 72 °C for 7 min. Products were run on a 2% agarose gel and visualized with SmartGlow Pre Stain (Accuris Instruments, Edison, NJ). Primers were tested before use by quantitative real-time PCR on a CFX Connect (Bio-Rad Laboratories, Hercules, CA).
Statistical analysis
Statistical analyses were carried out in R v3.5.1. Mean coverages across duplicated (n = 7852) and single-copy (n = 7061) genes at the 90% ID threshold were compared for significance using a Welch two-sample t test (unpaired, two-tailed) with 12,930 degrees of freedom and a p value < 2.2 × 10−16. The mean coverage of duplicated genes at %ID thresholds from 85 to 100% was compared for significance using one-way analysis of variance (ANOVA) with 15 degrees of freedom and a p value = 0.2. A similar ANOVA was used to compare single-copy genes at 85–100% ID thresholds, with a p value < 2.2 × 10−16.
Data and scripts
Data generated for this project that is not published elsewhere, including BLAST2GO annotations and OG assignments, as well as custom analysis scripts, are provided on GitHub in the “cfelis_genome” repository available at https://www.github.com/wvuvectors/cfelis_genome.