M. oryzae cultures and DNA extraction
M. oryzae Guy11 was grown on Difco oatmeal agar plates for 21 days under constant light in a Percival Scientific Incubator Model CU-36L4 equipped with half fluorescent lights and half black lights. Then, 1 cm2 of mycelium was scraped from the colony edge and used to start 3 liquid cultures (biological replicates) in petri dishes with 15 ml complete medium [49]. Liquid cultures were incubated without shaking for 3 days in the same growth chamber.
Total DNA extraction was performed according to a protocol from the Prof. Natalia Requena group at the Karlsruhe Institute of Technology. Briefly, mycelium grown in liquid culture was washed 3 times with water and then ground in liquid nitrogen. Ground mycelium was incubated in extraction buffer (0.1M Tris-HCl pH 7.5, 0.05 M EDTA, 1% SDS, 0.5 M NaCl) at 65°C for 30 min. Then, 5M potassium acetate was added to the samples which were then incubated on ice for 30 min. The supernatant was then washed with isopropanol and ethanol. Finally, the DNA pellet was resuspended in water and treated with RNase A (Thermo Scientific).
O. sativa growth and DNA extraction
O. sativa samples were originally intended to serve as control samples to be compared to tissue infected by M. oryzae, and therefore, the methods below reflect this original intent. However, circularome sequencing data obtained from infected tissue was not included in this study as it included very little sequencing data that mapped to the M. oryzae Guy11 genome.
O. sativa cv. Nipponbare seeds were surface sterilized in 70% ethanol for 1 min and 10% bleach for 10 min with thorough rinsing in sterile deionized water after each before being placed on wet filter paper in a petri dish. The petri dish was wrapped in foil and placed at 4°C for 2 days to germinate. Germinated seedlings were planted in potting mix made up of 50% Turface and 50% Super Soil. Seedlings were grown for 3 weeks in a greenhouse under standard conditions. For three samples, the first true leaf was cut from one rice plant, its tip removed, and then cut into two equal segments, approximately 10mm in length. This pair of segments was then placed on their abaxial surface on wet filter paper in a petri dish. Five hole-punches of filter paper soaked in 0.25% gelatin and 0.05% Tween-20 were then placed on each segment. The petri dishes were placed in an airtight container with wet paper towels and then placed on a windowsill for 7 days. Hole-punches were removed and non-chlorotic tissue in contact with hole-punches was ground in liquid nitrogen. DNA was extracted using the Qiagen Plant DNeasy mini kit.
Circular DNA enrichment
Total DNA obtained from DNA extractions (biological replicates) were then split into three samples (technical replicates) before circular DNA enrichment. This enrichment was performed according to a protocol from Lanciano et al. with a few modifications [6]. Five micrograms of extracted DNA was used as input for circular DNA enrichment in M. oryzae, and 750 ng of extracted DNA was used for O. sativa. To purify the samples and begin removing large linear DNA fragments, the samples were treated using a Zymo Research DNA Clean and Concentrator kit with standard protocols. Linear DNA digestion was then performed using Epicentre PlasmidSafe DNase and incubated at 37°C for 24 h. DNase, ATP, and reaction buffer were then added to the samples every 24 h throughout the duration of the incubation. In total, the reaction was allowed to proceed for 96 h. Remaining DNA was then precipitated overnight at 4°C by adding 0.1 volume 3M sodium acetate, 2.5 volumes ethanol, and 1 μl glycogen (20 mg/ml). Rolling circle amplification was then performed using the Illustra TempliPhi 100 Amplification Kit (GE Healthcare). Precipitated DNA was resuspended directly in 20 μl of the Illustra TempliPhi sample buffer, and the amplification reaction was allowed to proceed for 24 h at 30°C.
Verification of circular DNA enrichment
In a separate experiment, 5 samples of M. oryzae mycelium were grown up in liquid culture and total DNA was extracted. Circular DNA enrichment was performed as before with some exceptions and without technical replicates. First, linear DNA digestion was only performed for 72 h for 3 samples. Next, aliquots of the incubating samples were taken at 0, 24, 48, and 72 h for these 3 samples, and 0, 48, 72, and 96 h for the last 2 samples. qPCR was then used to verify linear DNA depletion in each sample using an Applied Biosystems QuantStudio 5 instrument and the QuantStudio Design and Analysis desktop software. Primers were used to amplify a portion of the M. oryzae actin gene (MGG_03982) along with Lightcycler 480 Sybr Green I master mix (Additional file 4: Table S3). Data from four qPCR technical replicates was obtained. Remaining linear DNA fraction in each sample at each timepoint was then calculated using the 2−ΔΔCt method.
Illumina library preparation and sequencing
Library preparation was performed by the QB3-Berkeley Functional Genomics Laboratory at UC Berkeley. DNA was fragmented with an S220 Focused-Ultrasonicator (Covaris), and libraries prepared using the KAPA Hyper Prep kit for DNA (Roche KK8504). Truncated universal stub adapters were ligated to DNA fragments, which were then extended via PCR using unique dual indexing primers into full-length Illumina adapters. Library quality was checked on an Agilent Fragment Analyzer. Libraries were then transferred to the QB3-Berkeley Vincent J. Coates Genomics Sequencing Laboratory, also at UC Berkeley. Library molarity was measured via quantitative PCR with the KAPA Library Quantification Kit (Roche KK4824) on a BioRad CFX Connect thermal cycler. Libraries were then pooled by molarity and sequenced on an Illumina NovaSeq 6000 S4 flowcell for 2 × 150 cycles, targeting at least 10Gb per sample. FastQ files were generated and demultiplexed using Illumina bcl2fastq2 version 2.20 and default settings, on a server running CentOS Linux 7. One technical replicate did not pass quality control before library preparation and was omitted.
PacBio library preparation and sequencing
Using a Covaris S220 Focused-Ultrasonicator, 2 μg of each DNA sample was sheared to an approximate fragment size of 5000 bp and purified using AMPure XP beads (Beckman Coulter). Library preparation was performed using the NEBNext Ultra DNA Library Prep Kit (kit number E7370L, New England Biolabs) and 8 cycles of PCR. Barcode sequences and barcodes assigned to each sample are described in Additional files 31 and 32. Libraries were then quality controlled using a Bioanalyzer high-sensitivity DNA chip and the Agilent 2100 Bioanalyzer system. One technical replicate did not pass quality control before library preparation and was omitted. The samples were then submitted to Novogene (Tianjin, China) for PacBio sequencing which was performed on the PacBio Sequel platform using a 600-min sequencing strategy and three SMRT cells.
Inferring eccDNA forming regions from short-read sequencing data
Illumina sequencing signal was analyzed using a custom pipeline inspired by previously published methods [7]. Illumina reads were first trimmed of Illumina TruSeq adapters using CutAdapt [50] version 2.4 with the nextseq-trim=20 option. Trimmed reads were then mapped to the M. oryzae Guy11 genome [37] and the 70-15 mitochondrial sequence [51] obtained from the Broad Institute (https://www.broadinstitute.org/scientific-community/science/projects/fungal-genome-initiative/magnaporthe-comparative-genomics-proj) using BWA-MEM [52] version 0.7.17-r1188 and the q and a options. Read mapping to mitochondrial sequences were excluded. Uniquely mapped reads were then mined for split reads that mapped in the same orientation, had at least 20 bp of alignment on either side of the split, and mapped to only two places in the genome. We also only selected split reads where the start of the read mapped downstream from the end. This last filter sets these split reads apart from split reads that would indicate a deletion in the genome. Split reads for which one side of the split read mapped more than 50kbp away from the other, or to a different scaffold than the other, were excluded. Opposite facing read pairs were also obtained from uniquely mapped reads. Candidate eccDNA forming regions were then inferred by combining these two structural read variants. A split read that contained an opposite facing read pair that mapped no more than a combined 500 bp from the borders of the region contained within the two halves of the split read was considered a candidate eccDNA, and a junction split read. The length distribution of these candidate eccDNA forming regions (Additional file 1: Fig. S35A) was then used to probabilistically infer candidate eccDNA forming regions from multi-mapping reads (Additional file 1: Fig. S35B). For each multi-mapping split read, a list of potential combinations of alignments that satisfied the previously described criteria for split reads was generated, and one of these combinations was chosen at random, weighted by its length according to the generated length distribution. The chosen combinations were then used to infer additional candidate eccDNA forming regions by combining these with opposite facing read pairs as before, except this time obtained from unique and multi-mapping reads.
Each candidate eccDNA forming region was then validated by verifying that the region had over 95% read coverage and at least two junction split reads with the exact same coordinates. Candidate eccDNA forming regions that did not pass these criteria were considered low quality and were not included in the analysis.
Inferring eccDNA forming regions from long-read sequencing data
CCS were first called from PacBio data using ccs version 3.4.1 (https://ccs.how/). Demultiplexing was then performed using lima version 1.9.0 (https://lima.how/) and sequences of barcodes used for library preparation (Additional files 31 and 32). CCSs were then mapped to the M. oryzae Guy11 genome using minimap2 [53] version 2.18-r1015. Only uniquely mapped reads were kept for analysis. EccDNA forming regions were then identified by looking for split reads that either: (1) mapped in the same orientation to the same exact region multiple times or (2) mapped less than 50 kb apart, in the same orientation and oriented properly so that they were indicative of a circular junction rather than a deletion.
Outward PCR validation of eccDNA forming regions and PCR validation of eccDNA-absent genes
Outward facing primers were designed to 8 eccDNA forming regions of interest to validate their presence in our eccDNA sequencing samples. Primers were designed to amplify the junction of each eccDNA but not result in a product of the same size when used on genomic DNA (Additional file 4: Table S3). Primer3 [54] was used for primer design, and the oligonucleotides were synthesized by Integrated DNA Technologies. PCR was performed using New England Biolab’s Phusion High-Fidelity DNA polymerase on M. oryzae Guy11 genomic DNA and rolling circle amplification products for the sample each eccDNA forming region was found in. Five nanograms DNA of each sample was used per 50 μl PCR reaction as well as 5X Phusion HF buffer, 10 mM dNTPs, 10 μM forward primer, 10 μM reverse primer, and 1 unit of Phusion DNA polymerase. PCR conditions were as follows: initial denaturation at 98°C for 30 s, Thirty-five cycles of denaturation at 98°C for 10 s, annealing at 64 or 65°C for 30 s, extension at 72°C for 10 s, and a final extension at 72°C for 5 min. PCR products were run on a 2% agarose gel to check for amplification. Bands of the expected size were extracted from electrophoresis gels using Zymo Research’s Zymoclean Gel DNA Recovery Kit. Sanger sequencing was performed by the UC Berkeley DNA Sequencing Facility, and Sanger sequences were examined for matches to corresponding eccDNA forming regions using BLASTN [55] version 2.2.9 and manual inspection.
PCR validation of eccDNA-absent genes was performed using similar methods. Primers were designed to amplify the entire annotated gene region of MYO1 and the actin gene (MGG_03982) and a small segment of the MAGGY LTR retrotransposon from genomic DNA. Two nanograms DNA of each sample was used per 20 μl PCR reaction as well as 5X Phusion HF buffer, 10 mM dNTPs, 10 μM forward primer, 10 μM reverse primer, and 0.4 units of Phusion DNA polymerase. PCR conditions were as follows: initial denaturation at 98°C for 30 s, 25 cycles of denaturation at 98°C for 10 s, annealing at 64 or 65°C for 30 s, extension at 72°C for 5, 60, or 120 s, and a final extension at 72°C for 5 min. PCR products were run on a 1% agarose gel to check for amplification.
Comparing eccDNA forming regions inferred from Illumina data and eccDNA forming regions inferred from PacBio data
EccDNA forming regions called using Illumina data and PacBio data were found to be identical if their start and end coordinates were within 10 bp of each other to account for mapping errors. EccDNA forming regions were then called with less stringent requirements to verify if any of the missing eccDNA forming regions were being filtered out somewhere in the pipeline. In this test, all uniquely mapped split reads that had 10 or more bp overlap on either side were properly oriented, and those less than 50kb apart were considered eccDNA forming regions.
Benchmarking eccDNA forming regions called using our pipeline on previously published data
EccDNA forming regions called using our pipeline were compared to eccDNA forming regions previously published for H. sapiens [7]. EccDNA forming regions were found to be identical if their start and end coordinates were within 10 bp of each other. EccDNA forming regions described as low quality by the authors were excluded from the published dataset before comparison. High-coverage eccDNA forming regions were chosen for comparison if they had more than 10 associated junction split reads. Finally, multi-mapping reads were excluded from the pipeline to identify eccDNA forming regions called using only uniquely mapped reads.
Comparing eccDNA sequencing samples to each other
Overlaps in eccDNA forming regions between samples were first calculated based off the exact coordinates of the eccDNA forming regions and Venn diagrams based off these overlaps were generated using the ggVennDiagram R package [56] version 1.2.0. EccDNA forming regions found in all technical replicates taken from each biological replicate were first combined before looking for overlaps between biological replicates. Overlaps were then calculated with various levels of tolerance for the start and end coordinates of the eccDNA forming regions so that regions in one sample that were within 10, 100, or 1000 bp from the start and end coordinates of a region in another sample were considered to be found in both samples. Rarefaction analysis for eccDNA forming regions in all samples was performed by sampling mapped eccDNA sequencing reads at random in increasing 10% intervals. For each subsample, eccDNA forming regions were called as previously described and counted. Principal component analysis of read coverage was performed by first calculating junction split read coverage for all 10kbp windows in the genome for each sample. These values were then normalized to the total number of junction split reads in each sample. The matrix of normalized junction split read coverage for all samples was then processed using the prcomp function in R version 3.6.1 with the scale = TRUE option, and the first 6 principal components were plotted using the ggbiplot R package [57] version 0.55.
Gene and effector annotation
The M. oryzae Guy11 genome along with 162 other rice-infecting M. oryzae genomes (Additional file 25) were annotated using the FunGAP [58] version 1.1.0 annotation pipeline. For all genomes, RNAseq data (SRR8842990) obtained from GEO accession GSE129291 was used along with the proteomes of M. oryzae 70-15, P131, and MZ5-1-6 taken from GenBank (accessions GCA_000002495.2, GCA_000292605.1, and GCA_004346965.1, respectively). The “sordariomycetes_odb10” option was used for the busco_dataset option and the “magnaporthe_grisea” option was used for the augustus_species option. For repeat masking, a transposable element library generated by combining the RepBase [59] fngrep version 25.10 with a de novo repeat library, generated by RepeatModeler [60] version 2.0.1 run on the M. oryzae Guy11 genome with the LTRStruct option, was used for all genomes. Genes in M. oryzae Guy11 were assigned names according to the gene names listed on UniProtKB for M. oryzae 70-15 accessed in October 2021. To make this assignment, M. oryzae Guy11 proteins were aligned to the M. oryzae 70-15 proteome using BLASTP [55] version 2.7.1+. Hits with greater than 80% sequence identity that spanned more than 80% of the length of both the M. oryzae Guy11 protein and the M. oryzae 70-15 protein were assigned names.
Effectors were predicted among M. oryzae Guy11 genes by first selecting genes with signal peptides which were predicted using SignalP [61] version 5.0b Darwin x86_64. Genes with predicted transmembrane domains from TMHMM [62] version 2.0c were then excluded. Finally, EffectorP [63] version 2.0 was used to predict effectors from this secreted gene set. Previously well-characterized effectors were identified using previously published protein sequences [45] and DIAMOND [64] version 2.0.9.147.
High-quality LTR retrotransposon annotations in M. oryzae
High-quality, full-length, consensus sequences for known Gypsy elements in M. oryzae (MAGGY, GYMAG1, GYMAG2, PYRET, MGRL3) and one Copia element (Copia1) were generated using the WICKERsoft [65] suite of tools. Reference sequences from other genomes for each element were obtained from the RepBase [59] fngrep version 25.10 library. The M. oryzae Guy11 genome was then scanned for the presence of these sequences using BLASTN [55] version 2.2.9 and then filtered to hits with 90% sequence identity and that contained 90% of the sequence length. Hits for each reference sequence were then extended to include 500 base pairs of genomic sequence upstream and downstream of the hit. A multiple sequence alignment of hits for each reference sequence was then generated using ClustalW [66] version 1.83 and boundaries were visually inspected and trimmed. Consensus sequences for each element were then generated from these multiple sequence alignments. These consensus sequences were split into LTR and internal regions by self-alignment using the BLASTN [55] webserver in August 2020 to identify LTRs. These consensus sequences are available in Additional file 33. Finally, the locations of these elements in M. oryzae Guy11 genome were annotated with RepeatMasker [67] version 4.1.1 with the -cutoff 250, -nolow, -no_is, and -norna options to identify their locations in the M. oryzae Guy11 genome. For read coverage plots as well as histone and GC content plots, full-length LTR retrotransposon copies were required. These were identified by using the original full-length consensus sequences with RepeatMasker as before and then filtering to hits greater than 3000 bp in length and greater than 90% sequence identity.
Comparative analysis of eccDNA forming regions
Analysis of eccDNA forming regions in organisms other than M. oryzae were performed as described above for Illumina sequencing data using previously published genome, gene annotation, and transposable element annotation files (Additional file 34). However, unlike the other data used in this study, the sequencing data in the S. cerevisiae dataset was single-end and therefore opposite facing read pairs could not be used to infer eccDNA forming regions. Instead, only eccDNA forming regions with three overlapping junction split reads were used for analysis. For all organisms, read mapping to unplaced scaffolds and organellar genomes were removed after mapping as described above for the M. oryzae mitochondrial genome. These scaffolds were also removed from genome size, number of coding base pairs, and number of LTR retrotransposon base pair calculations for comparative analysis. To calculate the percent of the genome that was covered in each sample, the genomecov command of the BEDtools [68] suite versions 2.28.0 was used with the -d option along with the coordinates of eccDNA forming regions for each sample. Any base pair with a coverage value greater than zero was counted as being a portion of the genome in an eccDNA forming region.
Characterization of eccDNA formation by LTR retrotransposons
To generate the Manhattan plot, junction split reads were filtered by selecting regions that were made up of 90% LTR retrotransposon sequences. Junction split read coverage was then calculated for each 100-bp window in the genome. Coverage values were then normalized to the total number of LTR eccDNA junction split reads per sample. These coverage values were then averaged across technical replicates for each biological replicate and then averaged across biological replicates. Finally, only 100-bp bins that overlapped at least 50 bp with an LTR retrotransposon were plotted in Fig. 3A. For Additional file 1: Fig. S10, only bins with coverage greater than 0 were plotted.
To simulate expected read coverage for different types of LTR eccDNAs, the Copia1 consensus sequence was taken as a reference, though the MAGGY consensus sequence yielded identical results. Simulated DNA sequences were then generated for each type of LTR eccDNA. The expected 2-LTR circular sequence generated by NHEJ (scenario 1, Fig. 4A) was made up of two LTR sequences and the internal sequence, and the expected 1-LTR circle sequence generated by HR (scenario 3, Fig. 4C) was made up of one LTR sequence and the internal sequence. These sequences were shuffled 1000 times to generate 1000 sequences starting at various points of the expected circularized sequence. For the 1-LTR circle sequence generated by autointegration (scenario 2, Fig. 4B), the random autointegration events were simulated by choosing a random length segment of the internal sequence starting with its start or end, adding the LTR sequence to this sequence, and randomly shuffling the sequence to simulate a circular sequence. This process was repeated 1000 times to generate 1000 sequences. Finally, for each scenario, Illumina reads were simulated to reach 2000× coverage for each of the simulated sequences using ART Illumina [69] version 4.5.8 and the following parameters: 150 bp read length, 450 bp mean insert size, 50 bp insert size standard deviation, HiSeqX TruSeq. Reads were mapped to the simulated sequences using BWA-MEM [52] version 0.7.17-r1188 with default settings and coverage for each base pair was calculated.
To generate observed coverage for each element, sequencing read coverage across the genome was calculated for all 10 base pair windows in the M. oryzae Guy11 genome for each sample. Coverage values were then normalized to the total number of mapped sequencing reads in each sample. These coverage values were then averaged across technical replicates for each biological replicate and then averaged across biological replicates. Finally, profile plot data was generated for full-length, high-confidence sequences for each LTR retrotransposon using computeMatrix scale regions and plotProfile of the DeepTools [70] suite of tools version 3.5.1 using full-length, high-confidence LTR retrotransposon sequences. Profile plots were also generated using previously published whole genome sequencing data by averaging sequencing coverage across all three samples [32, 37, 38].
Identification of split reads associated with eccDNA formation from LTR retrotransposons
Split reads were first identified as any read that mapped to only two places in the genome with at least 20 base pairs of alignment on either side. LTR-LTR split reads were then selected from these split reads for each LTR retrotransposon if both sides of the split read had any overlap with any copy of that retrotransposon’s LTR in the genome. LTR-internal split reads were selected if one side of the split read had any overlap with any copy of the retrotransposon’s LTR in the genome and the other side had any overlap with any copy of the retrotransposon’s internal region in the genome. Read coverage, LTR-LTR split read coverage, and LTR-internal coverage were then calculated for each annotation of each LTR retrotransposon. Coverage values were then normalized to the total number of mapped sequencing reads in each sample. These coverage values were then averaged across technical replicates for each biological replicate and then averaged across biological replicates.
Comparison of microDNAs and large eccDNAs across organisms
Genome, gene annotation, and transposable element annotation files for each organism used for this analysis were as previously described (Additional file 34). Again, organellar genomes as well as unplaced contigs were filtered out of these files before analysis. Introns and UTRs were added to gene annotation files that were missing these elements using the “agat_convert_sp_gff2gtf.pl” and “agat_sp_add_introns.pl” commands from the AGAT toolkit version 0.6.2 (https://github.com/NBISweden/AGAT). Cpgplot of EMBOSS [71] version 6.6.0.0 was used to annotate CpG islands in each genome. Upstream and downstream regions were defined as being 2000 base pairs upstream from the transcription start site and downstream from the transcription end site, respectively. Genic regions were defined as being made up of all sequences between transcription start and end sites, and intergenic regions were the opposite. Junction split reads were counted as being from a specific region if they overlapped to any extent within that region.
The observed percentage of junction split reads overlapping with each region type was calculated for each sample for each organism and an average of these percentages was calculated. The junction split reads of each sample were then shuffled across the genome 10 times, excluding LTR retrotransposon locations, and an expected percentage for each region was calculated, averaged across all permutations, then averaged across all samples for each organism. Finally, the log2 of the fold enrichment was calculated by taking the log2 of the observed average percentage over the expected average percentage.
Correlation of expression and eccDNA formation
Previously published RNAseq data from M. oryzae Guy11 grown in liquid culture in rich medium was obtained [72] (Additional file 35). The data was mapped to the M. oryzae Guy11 genome using STAR [73] version 2.7.1a with the quantMode GeneCounts option. Read counts per gene were then divided by library size and multiplied by the length of each gene in order to obtain reads per kilobase million (RPKMs). RPKMs per gene were then averaged across all samples.
Junction split read counts per gene used to analyze the correlation of expression and eccDNA formation were generated for each gene by counting the number of junction split reads that intersect the gene to any extent. Counts per gene were first assessed for each sample and normalized to the number of junction split reads in that sample. Normalized counts were then averaged across technical replicates for each biological replicate. Average counts per biological replicate were then averaged to obtain the final result.
To compare gene content and eccDNA formation, the M. oryzae genome was divided into 100kbp bins and the number of genes per bin was calculated. Junction split reads per bin were calculated for each sample using the same method. Junction split read per bin values were then normalized to the total number of junction split reads in each sample. These values were averaged across technical replicates for each biological replicate and then averaged across biological replicates.
ACS enrichment analysis
The published ACS sequence profile [42] was used to identify ACSs in eccDNA forming regions using the FIMO [74] software version 4.12.0. Only hits scoring greater than 17 were kept. In order to test for enrichment of these sequences, an expected distribution of ACS sequences was generated by randomly shuffling eccDNA forming regions across the M. oryzae Guy11 genome, excluding regions containing LTR retrotransposons. The observed number of ACS sequences in eccDNA forming regions was then compared to the expected distribution to generate a p-value.
Histone mark and GC content profile plots
Previously published ChIPSeq data for H3K27me3, H3K27ac, H3K36me3, and loading controls were obtained [72]. Sequencing reads for each technical replicate were combined before reads for each treatment for each biological replicate were mapped to the M. oryzae Guy11 genome using BWA-MEM [52] version 0.7.17-r1188 with default settings. The bamCompare command from the DeepTools [70] suite of tools version 3.5.1 with the scaleFactorsMethod readCount option was used to compare the signal from each treatment to the loading control for each biological replicate. computeMatrix scale regions was then used in conjunction with the plotProfile command to generate processed data for profile plots. After verifying that all biological replicates resulted in similar profile plots, only the first biological replicate was chosen for presentation.
To generate tracks used for profile plots, a few different strategies were used. GC content profile plots were generated by calculating GC percentage for 50 base pair windows throughout the genome. Profile plot data was then generated using computeMatrix scale regions and plotProfile commands as before. Methylated and acetylated genes were determined using the methylation and acetylation peaks published by Zhang et al. [72]. Marked genes were called when at least 50% of the gene overlapped with a peak. Large eccDNAs, microDNAs, and LTR eccDNAs from all M. oryzae Guy11 samples were combined into a single list which was filtered for duplicates and used for the corresponding tracks in the profile plots. The genome baseline track was generated by combining all of these eccDNA forming regions and shuffling them randomly across the genome. Finally, the full-length, high-quality LTR retrotransposon annotations described above were used for LTR retrotransposon tracks. The same approach was used for generating profile plots to compare histone marks and GC content for eccDNA-associated and eccDNA-absent genes.
Identification of eccDNA-associated and eccDNA-absent genes
Encompassing split read counts per gene for determining eccDNA-associated and eccDNA-absent genes were generated for each gene by counting the junction split reads that fully encompass the gene using the intersect command of the BEDTools [68] suite version 2.28.0 with the -f 1 option. This count was normalized to the total number of junction split reads in each sample, then averaged across technical replicates for each biological replicate. Genes with a count of zero were removed from each biological replicate before being sorted by this count. Genes in the top third for this count were compared between biological reps using the ggVennDiagram R package [56] version 1.2.0. This count was averaged across biological replicates to obtain the encompassing split read count per gene for visualizations in Fig. 5A and Fig. 8 and for comparison between predicted effectors and other genes (Additional file 1: Fig. S34).
GO enrichment analysis
GO terms were first assigned to annotated M. oryzae Guy11 genes using the PANNZER2 [75] webserver on August 17th, 2020. Annotated GO terms were then filtered to annotations with a positive predictive value greater than 0.6. The topGO [76] R package version 2.36.0 was used to parse assigned GO terms and reduce the gene list to a list of feasible genes for analysis. Either eccDNA-associated or eccDNA-absent were assigned as significant genes, and the number of these genes belonging to each GO term was used as the observed value for the enrichment analysis. A kernel density function was then generated using the gene lengths of the significant gene set. The same number of genes as the significant gene set was sampled at random from the feasible gene set using weighted random selection with weights obtained from the kernel density function. This random sampling was repeated 100 times, and the average of the number of genes belonging to each GO term was used as the expected value for the enrichment analysis. Finally, the chi-square statistic was computed comparing observed and expected values to test for enrichment or depletion of each GO term.
Gene presence-absence variation
In order to identify genes prone to presence-absence variation in the M. oryzae Guy11 genome, OrthoFinder [77] version 2.5.1 with default settings was used on all of the M. oryzae proteomes and the Neurospora crassa proteome obtained from GenBank (accession GCA_000182925.2). Then, for each M. oryzae genome, we queried whether each gene annotated in the M. oryzae Guy11 genome had an ortholog identified by OrthoFinder in that genome. Finally, the absence of genes without orthologues was confirmed using BLASTN [55] version 2.7.1+.
Small, genic deletions were identified using orthologs identified by OrthoFinder [77] version 2.5.1 as before. For each genome, we looked for genes in the M. oryzae Guy11 genome that had no ortholog in that genome, but that were flanked by two genes with orthologs in that genome. One-to-many, many-to-many, and many-to-one orthologs were excluded from this analysis. Candidate gene deletions were validated using alignments performed using the nucmer and mummerplot commands of the MUMmer [78] suite of tools version 4.0.0rc1 to verify that a DNA deletion truly existed and that this deletion overlapped the gene of interest.
Identification of eccDNA-mediated translocations
Identification of translocations with a potential eccDNA intermediate was done by first aligning two genomes using the nucmer command of the MUMmer [78] suite of tools version 4.0.0rc1 with the maxmatch option. The nucmer output was then parsed to look for portions of the reference genome that had an upstream region that aligned to one query scaffold, followed by two separate adjacent alignments to another query scaffold, followed by a downstream region that aligned to the original query scaffold. We also required that the two adjacent alignments in the center of the region were to adjacent regions in the query scaffold, but their order was reversed compared to the reference. Candidate eccDNA-mediated translocations were verified manually by inspecting alignment plots generated using the mummerplot command. The S. cerevisiae EC1118 (GCA_000218975.1) and M22 genomes (GCA_000182075.2) obtained from GenBank were used to verify the ability of our pipeline to detect these translocation events. The M. oryzae Guy11 genome was then compared to 306 M. oryzae genomes (Additional file 27) to look for these events in the M. oryzae species. Before alignment, transposable elements were masked from these M. oryzae genomes using RepeatMasker [67] version 4.1.1 with the -cutoff 250, -nolow, -no_is, and -norna options, as well as a transposable elements library generated by combining the RepBase [59] fngrep version 25.10 with the de novo repeat library generated by RepeatModeler [60] version 2.0.1 run on the M. oryzae Guy11 genome with default settings aside from the LTRStruct argument.
Minichromosome genes and eccDNAs
Scaffolds corresponding to minichromosomes in the M. oryzae FR13 (GCA_900474655.3), CD156 (GCA_900474475.3), and US71 (GCA_900474175.3) genomes were extracted according to previously published data [79]. Exonerate [80] version 2.4.0 was then used with the protein2genome model to identify genes in the M. oryzae Guy11 genome that were found on minichromosomes in these other isolates. Hits with greater than 70% sequence identity to any minichromosome scaffold were identified as genes found on minichromosomes. Encompassing split reads were then counted for all genes. This count was normalized to total number of junction split reads in each sample, then averaged across technical replicates for each biological replicate, and then averaged across biological replicates. Finally, normalized encompassing split read counts for genes found on minichromosomes were compared to genes not found on minichromosomes.
Rarefaction analysis for eccDNA-absent genes and unique eccDNA forming regions
Rarefaction analysis for genes found fully encompassed by eccDNA forming regions were performed by first sampling eccDNA forming regions from all samples at random in increasing 10% intervals. For each subsample, the number of genes found fully encompassed by eccDNA forming regions was determined as before. Next, eccDNA forming regions were shuffled across the genome and sampled at random in increasing 10% intervals. Again, the number of genes found fully encompassed by eccDNA forming regions was determined for each sample. This analysis was performed 100 times with similar results as those represented in Fig. 5C. A similar approach was used for rarefaction analysis of eccDNA forming regions but the number of unique microDNAs, large eccDNAs, and LTR eccDNAs were counted at each subsample instead.
Data processing and analysis
Data processing was performed in a RedHat Enterprise Linux environment with GNU bash version 4.2.46(20)-release. GNU coreutils version 8.22, GNU grep version 2.20, GNU sed version 4.2.2, gzip version 1.5, and GNU awk version 4.0.2 were all used for file processing and handling. Conda version 4.8.2 (https://docs.conda.io/en/latest/) was used to facilitate installation of software and packages. Code parallelization was performed with GNU parallel [81] version 20180322. Previously published data was downloaded using curl version 7.65.3 (https://curl.se/) and sra-tools version 2.10.4 (https://github.com/ncbi/sra-tools). Image file processing was performed with the help of ghostscript version 9.25 (https://ghostscript.com/) and imagemagick version 7.0.4-7 (https://imagemagick.org/index.php). BED format files were processed using bedtools [68] version 2.28.0 and bedGraphToBigWig version 4 (https://www.encodeproject.org/software/bedgraphtobigwig/). SAM and BAM format files were processed with SAMtools [82] version 1.8 and Picard version 2.9.0 (https://broadinstitute.github.io/picard/).
Data processing was also facilitated by custom Python scripts written in Python version 3.7.4 with the help of the pandas [83] version 0.25.1 and numpy [84] version 1.17.2 modules. The scipy [85] version 1.4.1 and more-intertools version 7.2.0 (https://more-itertools.readthedocs.io/) modules were also used.
Data analysis and statistical analyses were performed in R version 3.6.1. Data handling was processed using data.table [86] version 1.13.6, tidyr [87] version 1.1.3, reshape2 [88] version 1.4.4, and dplyr [89] version 1.0.4 packages. Plotting was performed using the ggplot2 [90] version 3.3.5 package, with help from RColorBrewer [91] version 1.1.2, scales [92] version 1.1.1, cowplot [93] version 1.1.1, ggprepel [94] version 0.9.1, and ggpubr [95] version 0.4.0 packages. The Gviz [96] version 1.28.3 was used for BAM file visualization. Tables were made using gt [97] version 0.3.1.