The stem rust fungus Puccinia graminis f. sp. tritici induces centromeric small RNAs during late infection that are associated with genome-wide DNA methylation

Background Silencing of transposable elements (TEs) is essential for maintaining genome stability. Plants use small RNAs (sRNAs) to direct DNA methylation to TEs (RNA-directed DNA methylation; RdDM). Similar mechanisms of epigenetic silencing in the fungal kingdom have remained elusive. Results We use sRNA sequencing and methylation data to gain insight into epigenetics in the dikaryotic fungus Puccinia graminis f. sp. tritici (Pgt), which causes the devastating stem rust disease on wheat. We use Hi-C data to define the Pgt centromeres and show that they are repeat-rich regions (~250 kb) that are highly diverse in sequence between haplotypes and, like in plants, are enriched for young TEs. DNA cytosine methylation is particularly active at centromeres but also associated with genome-wide control of young TE insertions. Strikingly, over 90% of Pgt sRNAs and several RNAi genes are differentially expressed during infection. Pgt induces waves of functionally diversified sRNAs during infection. The early wave sRNAs are predominantly 21 nts with a 5′ uracil derived from genes. In contrast, the late wave sRNAs are mainly 22-nt sRNAs with a 5′ adenine and are strongly induced from centromeric regions. TEs that overlap with late wave sRNAs are more likely to be methylated, both inside and outside the centromeres, and methylated TEs exhibit a silencing effect on nearby genes. Conclusions We conclude that rust fungi use an epigenetic silencing pathway that might have similarity with RdDM in plants. The Pgt RNAi machinery and sRNAs are under tight temporal control throughout infection and might ensure genome stability during sporulation. Supplementary Information The online version contains supplementary material available at 10.1186/s12915-021-01123-z.


Background
Epigenetic regulation controls transcription through formation of transcriptionally inactive chromatin (heterochromatin) and is mediated by interactions between small RNAs (sRNAs), DNA methylation and/or repressive histone modifications. In plants, sRNAs are predominantly in the size range of 20-24 nt and can be divided into two classes: (1) small interfering RNAs (siRNAs) processed by Dicer proteins from long double-stranded RNA (dsRNA) and (2) microRNAs (miRNAs) processed from stem-loop regions of single-stranded primary RNAs [1]. Both miRNAs and siRNAs are bound to argonaute (AGO) proteins to induce silencing of targets by base-pairing interactions and complementarity [2].
Heterochromatin plays both regulatory and structural roles. Heterochromatin not only regulates gene transcription, but also ensures proper chromosome segregation during cell division at centromeres and genome stability through regulation of transposable elements (TEs) [3]. Epigenetic silencing in repetitive genome regions is a key mechanism to prevent the proliferation of TEs. In fungi and plants, DNA cytosine methylation (5-methylcytosine; 5mC) is found mainly in transposable elements and other repeated DNA [4,5]. In plants, RNA-directed DNA methylation (RdDM) is the major sRNA-mediated epigenetic pathway and functions in maintaining genome stability through transposon control, pathogen defence and stress responses, intercellular communication and germ cell specification [6]. RdDM uses sRNAs to trigger DNA cytosine methylation at homologous DNA sequences in the genome [7]. These nuclear-localized heterochromatic sRNAs are the most abundant sRNAs in plants, predominantly 24 nucleotides (nts) in length, derived from intergenic or repetitive regions and associated with the argonaute 4 (AGO4) clade to regulate epigenetic silencing. Adenine is the most common 5′ base of AGO4-bound 24-nt sRNAs in Arabidopsis [8].
Unlike the extensively studied RdDM pathway in plants [9], the mechanisms of epigenetic silencing in the diverse fungal kingdom have remained elusive [10]. The RNAi machinery of the fission yeast Schizosaccharomyces pombe and the filamentous fungus Neurospora crassa are thus far the best-studied non-pathogenic model species [11]. In fission yeast, RNAi components participate in heterochromatin formation through histone H3K9 modifications at centromeres, the mating type interval and the subtelomeric regions [12,13]. DNA cytosine methylation is absent in the model yeasts S. pombe and S. cerevisiae [14]. In Neurospora crassa, RNAi components are involved in quelling and meiotic silencing by unpaired DNA (MSUD), but not in heterochromatin formation. Quelling is an RNAi-related genesilencing mechanism in Neurospora that is induced by repetitive transgenic sequences and occurs in the vegetative growth stage to control transposons [15]. Outside the model fungal species, very little is known about the interplay between sRNAs and epigenetic silencing, particularly in highly repetitive fungal pathogen genomes that need to inactivate TEs. Unlike plants, fungi lack canonical gene-body methylation, but in line with plants, 5mC is abundant in repetitive DNA and transposons across fungal species [4]. RNAi has been suggested as a key determinant of longer centromeres in the human fungal pathogen Cryptococcus and as a suppressor of centromeric retrotransposons to ensure genome stability [16]. In the human pathogen Mucor circinelloides, retrotransposons surrounding the centromeres are silenced by a canonical RNAi pathway involving Dcl2 and Ago1 [17] and a non-canonical RNAi pathway represses the canonical pathway, controlling virulence processes and transposon movements [18]. How RNAi contributes to epigenetic silencing in plant-pathogenic fungi has thus far largely remained unexplored. In the ascomycete Magnaporthe oryzae, a plant pathogen, 18-23-nt sRNAs are produced from repetitive elements and are implicated in TE regulation in vegetative tissue, whereas 28-35-nt sRNAs mapping to transfer RNA (tRNA) loci are enriched in the appressoria [19]. However, a correlation between sRNAs and epigenetic silencing has not been shown in M. oryzae. In the white-rot basidiomycete fungus Pleurotus ostreatus, TE silencing is associated with 21-nt sRNAs and DNA methylation [20].
The basidiomycete fungus Puccinia graminis f. sp. tritici (Pgt) is a plant pathogen that causes wheat stem rust disease, resulting in devastating crop losses [21]. Pgt is a dikaryotic fungus that contains two distinct haploid nuclei. During the asexual infection phase on a cereal host, Pgt produces single-celled dikaryotic urediniospores that germinate on the leaf surface [22,23]. Subsequently, appressoria form and penetration occurs through stomata with subsequent development of specialized infection structures called haustoria by around 2 days. Haustoria enable nutrient uptake as well as the delivery of secreted pathogen proteins called effectors into the host plant cell [24]. The start of urediniospore production occurs at approximately 6-7 days post-infection (dpi) and urediniospore pustules typically erupt through the leaf or stem surface (sporulation) after 8-10 dpi [22]. In the poplar rusts, intense cell division activity has been observed in the sporulation area [25].
Chromosome-scale genome assemblies offer the opportunity to investigate the structural organization of the genome including localization of centromeres, transposable elements (TEs), DNA methylation, sRNAs and how this links to their function. Recently, the chromosome-scale assembly of Pgt 21-0 has become available [26]. This assembly is fully phased with 18 chromosome pseudomolecules for each of the two haplotypes derived from the two haploid nuclei. Whilst substantial time-course transcriptomic resources have been generated for Pgt [27][28][29], how it utilizes RNAi and epigenetic silencing during the infection cycle has thus far been unknown. Here, we bring together Hi-C data, methylation data and sRNA/transcriptome sequencing data to uncover fundamental characteristics of the stem rust RNAi machinery, DNA methylation and the firsttime characterization of rust centromeres.

Results
Pgt centromeres are highly repetitive sequences with little sequence conservation between haplotypes We used chromatin conformation capture assay data (Hi-C) from Pgt 21-0 [26] to pinpoint the location of the Pgt centromeres, the first-time characterization of rust fungal centromeres. Fungal centromeres give rise to a distinct outwards-spreading shape in a Hi-C contact map [30], as seen in the contact maps for individual chromosomes (Additional files 1 and 2: Fig. S1 and Fig.  S2). Because the centromeres of each chromosome tend to cluster in the three-dimensional space of the nucleus, this region also shows a physical association between chromosomes visible as distinct cross-shapes in the whole haplotype Hi-C contact map (Fig. 1A). We selected the midpoint of the outwards-spreading shape in the Hi-C contact maps of each chromosome as the putative centre of each centromere. For example, Pgt chromosome 1A has a suggested centromere centre around position 2.36 MB and the surrounding region shares strong Hi-C links with the putative centromeres on other chromosomes ( Fig. 1A and Additional file 3: Fig. S3). To add further support to the centromeric regions and their lengths, we plotted the density of expressed genes, RNA-seq transcription levels at late infection and in germinated spores as well as the coverage of repetitive elements on the chromosomes. This shows that the regions around the putative centromere centres indicated by the Hi-C data are transcriptionally silent, gene-poor and repeat-rich regions (Figs. 1B and 2, Additional file 4: Fig. S4). We selected putative centromere boundaries by inspecting the lengths of these transcriptionally silent, gene-poor regions for each chromosome Fig. 1 Hi-C contact map shows the location of the Pgt centromeres. A A Hi-C contact map of the 18 chromosomes in haplotype A shows the position of the centromeres as cross-like shapes, highlighted with a red rectangle. B The positions of the centromeres in haplotype A as indicated by the Hi-C contact map are in transcriptionally silent genomic regions. Reads per million (RPM) for the late infection (7 dpi) and germinated spores RNA-seq samples are shown in red and green, respectively (10-kb windows, RPM from 0-100 are shown for clarity) (Fig. 2). Centromeres appear to span between 100 and 340 kb (253 kb on average), with only a slight decrease in GC content for most chromosomes compared to the rest of the chromosome (Table 1).
We then aligned the two haplotype chromosomes. Interestingly, some chromosomes share regions of macro-synteny (conserved regions > 20 kb) in their centromeres whereas others do not. For example, chromosomes 1A and 1B show no to very low sequence identity in the centromeric region, as opposed to the remainder of the chromosomes (Fig. 3A). In contrast, chromosomes 2A and 2B share macrosynteny in the centromeric region (Fig. 3B). We then used pairwise k-mer distance estimation to compare centromeric regions and non-centromeric regions for all chromosomes. Clustering analysis showed a large distance between the centromeric regions and noncentromeric regions (Fig. 3C). For the noncentromeric regions, the two homologous chromosomes all grouped into closely related pairs with similar distances for all 18 chromosome pairs. In contrast, whilst most (14/18) centromeric regions grouped by chromosome pairs, the difference between them varied, with some very closely related (e.g. on chromosomes 2A and 2B) and others quite divergent (e.g. chromosomes 1A and 1B). Others showed unexpected groupings. For example, the centromeres of chromosomes 18A and 15B are more closely related to each other than to the centromeres of the corresponding chromosomes 18B and 15A (Fig. 3C). Overall, the centromeric regions of Pgt are highly variable and unexpectedly, most of them are also highly divergent between haplotypes.
Young transposable elements accumulate in the highly repetitive Pgt centromeric regions We then assessed the repetitive element coverage of the Pgt chromosomes and their centromeres. All 36 (2*18) Pgt centromere regions have higher coverage of repetitive elements than the non-centromeric regions (Fig. 4A). Repetitive elements cover 75-96% of the bases in the Pgt centromeres compared to only 52-62% of the noncentromeric regions on the chromosomes. The repeat types with the highest sequence coverage in the centromeric regions vary considerably between the chromosomes. Most centromeres are enriched for LTR Gypsy retrotransposons (17-56%) compared to non-centromeric regions (11-17%), although this is also the most abundant repeat family outside the centromeres (Fig. 4B). However, several centromeres show a high coverage of repeat families that are of low abundance outside the centromeric regions. For example, DNA transposons of the superfamily CACTA are highly abundant in the centromere of chromosome 17A (33% coverage), whilst the centromere of chromosome 11B is enriched for LTR Copia retrotransposons (35% coverage). Again, these patterns are not always shared between centromeres within a chromosome pair.
To determine whether the age of TEs affects their distribution, we used the nucleotide identity of each TE to the consensus sequence of the family (provided by the Fig. 2 Pgt centromeric regions for two selected chromosomes. Karyoplots of the centromeric regions of Pgt chromosomes 1A and 2A. The density of expressed genes and the coverage of repetitive elements are shown as well as the GC content (1-kb windows). Reads per million (RPM) for the late infection (7 dpi) and germinated spores RNA-seq samples are shown as red and green lines, respectively (10-kb windows). Centromeric regions are highlighted with yellow boxes REPET pipeline [31]) as a proxy for the relative age of TE insertion. Most TEs in the Pgt genome have >70% identity to the consensus; however, the centromeres are enriched for young TEs (defined as having > 90% identity, Fig. 4C). In the centromeres, 28.3% of repeats with family identity information are young TEs compared to 18.8% outside the centromeres. Taken together, the centromeres are highly repetitive regions in the Pgt genome that are enriched for young TEs, similarly to Arabidopsis where the majority of young repeats are found in pericentromeric domains [32].
Pgt centromeres are heavily 5mC-methylated at genomic CG sites We used Nanopore sequencing to detect DNA methylation in genomic DNA of Pgt during two distinct infection stages: (1) germination of spores and (2) late infection stage of wheat when sporulation starts (7 dpi). The germinated spore and late infection methylation data have~48x and 35x genome coverage, respectively. The Nanopore signal distinguishes 5-methylcytosine (5mC) from unmethylated cytosine and N6-methyladenine (6mA) from unmethylated adenine [33]. Methylated sites were defined as nucleotide positions where more than 50% of sequence reads showed the presence of a modified residue. We found that the occurrence of methylated cytosine residues was substantially higher in CG dinucleotide and CCG trinucleotides than in other di-and trinucleotide contexts, similar to cytosine methylation patterns in plants. The proportion of both CG and CCG methylation sites was significantly higher in centromeric regions (45.5% and 26.6%) than in non-centromeric regions (21.3% and 12.9%). Levels of 6mA methylation were low both inside and outside the centromeres with no substantial difference between dinucleotide contexts (Fig. 5A). The frequency of methylation at methylated CG dinucleotide sites (i.e. % of reads from a site that show base modifications) is also higher for sites that occur in the centromeres than for those outside the centromeres, but very similar between germinated spores and late infection (Fig. 5B). Taken together, Pgt has a strong preference for genomic CG (CpG) methylation and centromeres are heavily CG-methylated genomic regions.
CG methylation is associated with silencing of young TE insertions both inside and outside of centromeres CG methylation is strongly associated with repetitive regions in both germinated spores and late infection, with 89% and 88.8% of methylation sites overlapping with TEs, respectively. 26.6% of all Pgt TEs are methylated (i.e. they overlap with at least two methylation sites) in germinated spores and 24.9% in late infection. Ninetythree per cent of methylated TEs show methylation in both conditions. A higher percentage of TEs in centromeric regions are methylated (44.8% and 47.9% in infected leaves and germinated spores, respectively) than TEs in non-centromeric regions (23.9% and 25.5% in infected leaves and germinated spores, respectively). We did not observe significant differences in TE family distributions for TEs that are methylated only in germinated spores or only in infected leaves (data not shown). CG methylation is strongly associated with young TEs (> 90% identity), not only inside the centromeric regions but even more so outside the centromeres. 56.1% and 51.9% of young TEs in the centromeres are methylated  Table 2). Whilst the centromeres are highly methylated genomic regions and preferentially harbour young TE elements (Fig. 4C), young TEs that occur outside the centromeres are also heavily methylated. 48.6% and 46.2% of young TEs outside the centromeres are methylated in germinated spores and in infected leaves, respectively ( Table 2). This suggests that Pgt employs a mechanism that maintains silencing of young TEs not only in the centromeres but also through targeting of their homologous sequences outside the centromeres. We hypothesized that Pgt might employ sRNAs to direct DNA methylation to young TEs outside the centromeres.

Pgt induces early and late waves of sRNAs with opposing profiles
To assess the role of the RNAi machinery in methylation, we performed sRNA sequencing on germinated spores, uninfected wheat and infected wheat at 3 dpi, 5 dpi and 7 dpi. Adapter-trimmed and tRNA/rRNA-filtered sRNA reads were first mapped to the wheat and Pgt genomes. Strikingly, the read alignment rates show a strong presence of Pgt sRNAs in the late infection sample (7 dpi, Table 3). The mapping rates to rust in early infection (3 dpi and 5 dpi) are low at 5.25% and 5.37%, respectively, but increase drastically to 50.2% in late infection (7 dpi). In contrast,~67% of sRNA reads map to We used the ShortStack software to predict miRNA and siRNA loci [34]. ShortStack uses several criteria to separate sRNA loci from degradation artefacts, such as read length distribution in a putative sRNA cluster as well as strandedness of the locus and the predicted precursor secondary structure in case of miRNAs. We predicted 7312 Pgt sRNA loci (7292 siRNA loci and 20 miRNA loci) and 413 wheat sRNA loci (361 siRNA loci and 52 miRNA loci) (Additional files 5, 6, 7, and 8: Data S5-S8). For each predicted sRNA locus, we obtained the most abundant sRNA. For predicted miRNA loci, this will generally be the functional mature miRNA. The read length distributions of rust and wheat sRNAs show different patterns and distinct peaks of abundance (Fig. 6). The Pgt-derived sRNAs are predominantly 20, 21 or 22 nts in length. This is true both for the single most abundant sRNA in each locus as well as for the total sRNA reads derived from each locus (Fig. 6). There are two distinct peaks at 21 nt and 24 nt for the wheat sRNAs, as is expected for plant sRNAs. Most predicted wheat miRNAs are 21 nt and have a 5′ uracil (67.6%) whilst the wheat siRNAs are mostly either 21 nt with a 5′ uracil or 24 nt with a 5′ adenine ( Table 4). The two distinct peaks at 21 and 24 nts with their corresponding 5′ nucleotide preferences support the predicted presence of both miRNAs and siRNAs in the wheat sRNA set and the 24 nt wheat siRNAs are likely involved in RdDM [8,Fig. 5 Methylation site preferences in Pgt. A Proportions of Pgt dinucleotides and trinucleotides that are methylated in the centromeres and outside the centromeres. CG dinucleotides are highly enriched for 5mC methylation. For the trinucleotides, CCG is enriched for 5mC methylation. We observed very low levels of 6mA methylation. The centromeres are heavily methylated compared to the non-centromeric regions. Slightly higher levels of 5mC methylation are seen in infected leaves compared to germinated spores in centromeres. B Box plots showing methylation frequency distribution of CG methylation sites. Centromeres show higher methylation frequencies than non-centromeric regions Table 2 Proportions of young and old Pgt TEs that are methylated in the centromeres and outside the centromeres. Both inside and outside the centromeres, young TEs (> 90% family identity) are preferentially targeted by CG methylation (chi-square test with significance p < 0.00001: *** )  35]. However, two distinct classes of siRNAs also appear to be present in Pgt based on 5′ nucleotide preference, although differing in size to the wheat siRNAs. Pgt siR-NAs of length 20-21 nts have a strong preference for a 5′ uracil (~75%), whereas 53% of the 22-nt Pgt siRNAs have a 5′ adenine, suggesting functional diversification. Next, we assessed the differential expression of Pgt sRNAs at the start of infection (germinated spores), during early infection (3 dpi and 5 dpi) and during late infection when sporulation begins (7 dpi) (Additional file 9: Table S9). We detected no differential expression of Pgt sRNAs between 3 dpi and 5 dpi, likely due to the low number of mapped reads (Table 3, Fig. 7A), and therefore combined these time points to represent early infection. Strikingly, 91.3% of the Pgt sRNA clusters are predicted as differentially expressed amongst germinated spores, early infection (3 and 5 dpi) and late infection (7 dpi): 2663 are up-regulated in germinated spores, 530 up-regulated during early infection and 4005 upregulated during late infection (Fig. 7, Additional files 10, 11, 12, and 13: Data S10-S13). A large proportion of the up-regulated sRNAs at the late infection time point (76.1%; 3046 of 4005) showed up-regulation against all the other conditions (germinated spores, 3 dpi and 5 dpi). In contrast, amongst the up-regulated sRNAs in germinated spores, the majority (86.9%; 2315 of 2663) are up-regulated against the late infection sample, with only a small number (33 and 29) being up-regulated compared to the 3 dpi or 5 dpi samples. Thus, the sRNAs up-regulated during late infection are highly specific to that time point, indicating the presence of early (germinated spores, 3 dpi and 5 dpi) and late (7 dpi) waves of Pgt sRNAs during wheat infection. In contrast to Pgt, which exhibits prominent early and late infection waves of sRNAs, only 14 of the 413 wheat sRNAs (3.4%) are predicted to be differentially expressed. Amongst these 14 differentially expressed wheat sRNAs, there is no predicted miRNA.
We assessed the length distributions and 5′ nucleotide preferences of differentially expressed Pgt sRNAs (Fig.  7C, D). The early wave Pgt sRNAs are predominantly 21 nts in length (44% and 46.2%, respectively). In contrast, the largest class (40.8%) of the late wave Pgt sRNAs are 22 nts in length. Pgt sRNAs with no detected differential  Fig. 7C). The majority (60-80%) of the 20-, 21-and 22-nt sRNAs up-regulated in germinated spores, during early infection and those with no differential expression contain a 5′ uracil (Fig. 7D). This is also true for 20 and 21 nt late wave sRNAs. In contrast, the 22-nt late wave sRNAs have a strong preference for 5′ adenines (~70%, Fig. 7D). This suggests the specific induction of a different functional class of sRNAs during these late infection stages, similar to the occurrence of 24-nt siRNAs with a 5′ adenine in plants.
To validate the sRNA deep sequencing data, we conducted an sRNA northern blot to detect two early-wave, miRNA-like candidates, miRL-1 (22 nts with 5′ adenine) and miRL-2 (22 nts with 5′ uracil) (Fig. 8A). Both have a predicted miRNA-like hairpin structure and are predicted to be up-regulated in germinated spores compared to late infection. As shown in Fig. 8B, both miRL-1 and miRL-2 of the 22-nt size showed high abundance in germinated spores but low expression in infected wheat tissues at the late stage of infection. This expression pattern is consistent with the sRNA-seq data showing high levels in germinated spores but low level at the late stage of infection (Fig.  8C). Interestingly, apart from the 22-nt band, there were additional sRNA bands either larger or smaller than 22 nt (Fig. 8B). For miRL-2, all sizes are downregulated in infected tissues, but for miRL-1, the larger, 23-nt band showed equal levels of accumulation between germinated spores and infected tissues, whereas all the smaller-sized sRNA species were downregulated. The 23-nt miRL-1 was less abundant in the sRNA-seq data than the 22-nt miRL-1. The 23-nt miRL-1 has 5 times more sRNA-seq reads per million in the germinated spores samples than in the late infection sample (22-nt miRL-1: 18 times more). The identity and functional significance of the various sRNA size classes require further investigation.
Late wave Pgt sRNAs are produced from the centromeric regions, whereas the early wave sRNAs are highly conserved and derived from genes The late wave Pgt sRNAs also exhibit opposing genomic origins to the early wave sRNAs ( Table 5). The early wave sRNAs predominantly map to annotated genes (77.1% in germinated spores; 68.3% at 3 and 5 dpi), compared to only 16% of late wave sRNAs. Late wave sRNAs are largely generated from repetitive elements (88.3%), in contrast to the early wave sRNAs (24.9% in germinated spores and 30.9% at 3 and 5 dpi). Most of the repetitive elements associated with sRNAs belong to the class of LTR retrotransposons, particularly Gypsy elements. Strikingly, 24% of the late wave sRNAs originate from the centromeric regions, in contrast to only 1-3% of the early wave sRNAs and the sRNAs with no differential expression (Table 5).
A gene function ontology (GO) term analysis of the 1878 genes that are associated with Pgt sRNAs upregulated in germinated spores reveals an enrichment in proteins with ATP binding activity as well as proteins with helicase and motor activity and RNA binding ( Table 6). No significant enrichment in functional annotation was observed for genes that are associated with sRNAs with no differential expression, or with sRNAs up-regulated during early or late infection.
We further investigated the locations of the Pgt sRNAs on the 18 chromosome pairs and found that similar proportions occur in each of the two haplotypes (Table 5). We then assessed if sRNAs have a homologous counterpart on the corresponding haplotype. For this, we remapped the sequencing reads that define an sRNA locus to the remainder of the genome and assigned the sRNA locus that has the highest coverage by those mapped reads as the homologous counterpart. Over two-thirds of sRNAs up-regulated in germinated spores have a homologous counterpart (68.1%, Table 5). Most of these homologous pairs (82.6%) are located on the corresponding chromosome from the alternative haplotype and generally occur in syntenous positions (Additional file 14: Fig. S14). This is consistent with the observation that most of these sRNAs map to gene sequences which are expected to occur in allelic positions in each haplotype. In contrast, only around half of the late wave sRNAs have a homologous counterpart (54.8%), and only 18.2% of these  (Table 5). In summary, the early wave sRNAs are conserved across the haplotypes and originate from gene models, whereas the late wave sRNAs originate from repetitive elements that are not conserved between haplotypes.
During late infection, Pgt sRNAs are highly expressed from the centromeres and appear to direct genome-wide methylation to young TEs To assess sRNA expression in the centromeres, we plotted sRNA levels at late infection and in germinated spores on the chromosomes. During late infection, strong peaks of sRNA expression are apparent in the centromeric regions, except for chromosomes 11A and 11B (Fig. 9). Interestingly, in this Pgt isolate (21-0), chromosome 11B resides in the haplotype A nucleus and has been involved in a single chromosome exchange event between nuclei [26]. The late wave Pgt sRNAs are heavily induced from the centromeres in late infection. Whilst there is also transcription of sRNAs from the centromeres in germinated spores, the centromeric peak is less apparent and the overall sRNA levels in the centromeres are only about a third of that observed during late infection (average reads per million, 160 at late infection and 54 in germinated spores).
To investigate the genomic regions that might be targeted by these centromeric Pgt sRNAs, we re-mapped sRNAs without mismatches to the chromosomes and recorded all their alignment positions. We then assessed which types of genomic regions are enriched for sRNA targeting. 32.3% of young TEs in the centromeres and 20.1% of the young TEs outside the centromeres have a late wave sRNA mapping to them, with much lower mapping rates to young TEs observed for the other sRNA classes (Table 7). In contrast, a much lower proportion (7.1% and 1.9%) of old TEs is associated with late wave sRNAs. This indicates that the late wave sRNAs might be involved in the silencing of young TEs, both inside and outside the centromeres.
To address this further, we explored whether a TE that has an sRNA mapping to it (TE sRNA+ ) is more likely to be methylated than a TE that does not have an sRNA mapping to it (TE sRNA− ). Indeed, we observed that the Pgt sRNAs are strongly associated with methylation of TEs, particularly young TEs, both inside and outside the centromeres (Table 8). Inside the centromeres, 79% of young TEs sRNA+ are methylated compared to 49.3% of TEs sRNA− . Strikingly, sRNA-associated methylation is also pronounced outside the centromeres where 80% of young TEs sRNA+ are methylated compared to 43.6% of TEs sRNA− . Taken together, our results indicate that the late wave Pgt sRNAs originate mainly from the Fig. 8 Northern blot analysis of two Pgt miRNA-like candidates. A Two miRNA-like candidates, miRL-1 (22 nts with 5′ adenine) and miRL-2 (22 nts with 5′ uracil) are shown with sRNA read coverage and predicted miRNA-like hairpin structures. B Northern blot analysis confirmed their downregulation in infected tissue comparing to germinated spores. The same blot was stripped and re-hybridized with Pgt U6 or wheat miR168 probes for use as the RNA loading control. The wheat miR168 band also indicates the position of 21-nt sRNA. C The sRNA-seq data shows that both miRNA-like candidates are highly abundant in germinated spores compared to late infection centromeres but might direct DNA methylation to loci homologous to their sequences both inside and outside the centromeres, preferentially targeting young TEs.
We then further investigated the correlation between methylated sites and sRNA loci for TEs. TEs that have a sRNA mapping to them have a higher proportion of CGs that are methylated (Table 9) and this is most pronounced in the late wave sRNAs and in the sRNAs that are 22 nts in length with a 5′ adenine. Methylation frequencies in TEs sRNA+ are higher than those in TEs sRNA− (Fig. 10). However, methylation levels appear to be relatively stable in both germinated spores and late infection. This indicates that RNA-directed methylation in Pgt might reinforce stable methylation at targeted transposon loci that are already DNA-methylated, similarly to RdDM in plants [36], rather than inducing transient methylation specifically late in infection.

TE-associated CG methylation leads to silencing of nearby genes
We investigated the effect of CG methylation and sRNA-directed CG methylation on overlapping or adjacent genes. Ten thousand four hundred seventy-eight Pgt genes overlap with CG methylation sites in germinated spores and 9894 genes overlap with CG methylation sites in late infection. The proteins encoded by these methylated genes are not enriched for secreted proteins or for any GO terms (data not shown). However, almost all these methylated genes overlap with repeats (except for 944 genes in germinated spores and 910 in late infection), suggesting that CG methylation predominantly correlates with TE silencing in Pgt.
To determine if the presence of methylated TEs affects nearby gene expression, we assessed transcription levels of genes that are close to or overlap with TEs. For both germinated spores and infected leaves, genes that have Fig. 9 Abundance of sRNAs on the Pgt chromosomes. The sRNA levels (reads per million per 10-kb genomic windows, < 1000 RPMs shown for clarity) are shown for late infection (red; above each chromosome) and for germinated spores (green; below each chromosome). Centromeric regions are indicated by yellow boxes. Higher levels of sRNAs are seen from the centromeres during late infection than in germinated spores an overlapping methylated TE have significantly lower gene transcription levels than those overlapping with a non-methylated TE (Fig. 11). This silencing effect is also observed for genes that are close (< 500 bps) to a methylated TE compared to genes that are close to a nonmethylated TE (Fig. 11A), whilst no difference was observed for genes > 500 bps from methylated TEs. We then compared the expression levels of genes that overlap with TEs considering whether they also overlap with the late wave sRNAs or not. Whilst genes that overlap with methylated TEs have low expression levels in general, they have significantly lower levels of expression when the TE is additionally targeted by a late wave sRNA and this holds true for both young and old TEs (Fig. 11B).

A subset of RNAi genes are up-regulated during late infection, supporting functional diversification
RNAi machinery genes were previously identified in the reference genome Pgt p7a [29,37]. We searched for the Pgt p7a RNAi genes in the gene annotation of the fully phased, chromosome-scale assembly of Pgt 21-0. Two argonaute genes, three dicer genes and five RdRP genes are present in the annotation of Pgt 21-0 on each haplotype (Table 10). We then searched for 5mC methyltransferase (5mC MTase) genes. Four classes of fungal DNA methyltransferases have been observed in fungi, but basidiomycetes predominantly have the DNMT1 and DNMT5 genes [38]. We identified DNMT1 and DNMT5 in the Pgt 21-0 annotation by searching for the previously identified Pgt p7a genes [4] and additionally for the DNA methylase domain PFAM domain (PF00145). In line with Bewick et al. [4], we found homologs of the DNMT1 and DNMT5 genes and confirmed the absence of the other two classes (DIM-2 and RID) and of 6mA DNA and RNA MTase genes in Pgt 21-0. The lack of 6mA DNA and RNA MTase genes in Pgt indicates that cytosine methylation is the primary DNA methylation process active in this species. The gene expression profiles of the RNAi and 5mC MTase genes during a time course of Pgt 21-0 infecting wheat from 2 to 7 days post-infection (dpi) [28] and in germinated spores and haustorial tissue [27] indicate  two main patterns (Fig. 12A): one set of RNAi genes (RdRPs 2/4/5, AGO2 and dicers 1/2) and 5mC Mtase genes that are constitutively expressed during infection, with the AGO2 genes showing particularly high expression, and another set of RNAi genes (AGO 1, dicer 3 and RdRPs 1/3) that are highly expressed only during the later stages of infection, with no or very low expression in germinated spores and during early infection. We did not observe differences in expression patterns of the RNAi genes between the two Pgt haplotypes. A protein domain analysis further supports the functional diversification of the Pgt argonautes AGO 1 and 2. AGO 1 has an argonaute linker 1 domain and is longer in sequence, whereas AGO 2 has an argonaute linker 2 domain (Fig. 12B). A phylogenetic tree constructed from the argonaute proteins of Arabidopsis thaliana and several rust fungi further supports the diversification of the rust AGOs into two classes. In Arabidopsis thaliana, AGO1 and AGO10 bind preferentially small RNAs with a 5′ uracil, whereas AGO2, AGO4, AGO6, AGO7 and AGO9 prefer sRNAs with 5′ adenines and AGO5 5′ cytosines [1] and their diversification into these three classes is apparent from a phylogenetic tree (Fig. 12C). The rust argonautes are distinct from the Arabidopsis clade and divide into two distinct families, with one copy of each present in the haploid genomes of each rust species. Taken together, the expression and sequence analyses show that Pgt RNAi machinery has functionally diversified and suggests that Pgt might use RNAi to regulate stage-specific infection processes, such as during the formation of new urediniospores during late infection.

Discussion
Epigenetic silencing mechanisms mediated by sRNAs and methylation are not well-studied in plantpathogenic fungi [10] and had thus far not been described in rust fungi. Current knowledge has been derived mainly from model species which comprise a relatively small group of fungi, or from studies in human fungal pathogens [16][17][18]. Through sRNA sequencing over a time course of Pgt-wheat infection, we uncovered that Pgt produces two distinct waves of sRNAs with example, a study in the stripe rust fungus Puccinia striiformis f.sp. tritici sequenced sRNAs at 4 dpi and found that the majority of the predicted 20-22 nt Pst sRNAs carry a 5′ uracil [39]. The presence of distinct sRNA profiles in mycelia and appressoria tissues was suggested Fig. 11 Expression levels of genes that have an inserted methylated repeat or that overlap with methylated repeats. A Expression levels (lognormalized transcripts per million) of genes are shown. A strong silencing effect is shown for genes that contain a methylated repeat, both in germinated spores and during late infection. Genes that are near a methylated repeat (up to 500 bps) also show suppressed expression. B This gene silencing is more pronounced if a TE has a late wave sRNA mapping to it (TEs sRNA+ ), both for young and old TEs Table 10 RNAi and 5mC methyltransferase genes in Pgt. For each protein, the identifiers of the allelic proteins on each haplotype are given. Homologs of the Pgt p7a PGTG_13081 and PGTG_13088 dicer proteins were not found in the gene annotation of Pgt 21-0 in the rice blast fungal pathogen, Magnaporthe oryzae [19]. However, prominent waves of sRNA expression profiles during infection of plants had thus far not been reported.
Pgt sRNA expression is under tight temporal control, with~90% of Pgt sRNAs differentially expressed over the time course. The presence of two distinct sRNA profiles has thus far not been observed in rust fungi and supports functional diversification of the RNAi machinery, with a strong role in the infection and proliferation process. The early wave sRNAs are predominantly 21 nts with a 5′ uracil derived from genes. In contrast, the late wave sRNAs are mainly 22-nt sRNAs with a 5′ adenine derived from repetitive sequences. We speculate that the majority of 22-nt Pgt sRNAs are responsible for transcriptional silencing of TEs during sporulation and the majority of 20-21-nt Pgt sRNAs mediate posttranscriptional silencing of genes. This is similar to what has been reported in plants, which produces 20-22-nt miR-NAs/siRNAs and 24-nt heterochromatic sRNAs [40]. In plants, TEs are silenced mainly via 24-nt sRNAs in the RdDM pathway [1]. These 24-nt sRNAs are most abundant during seed development in plants, presumably to ensure stable inheritance of the genome.
The up-regulation of 22-nt Pgt sRNAs with enrichment for 5′ adenines during late infection coincides with the up-regulation of the AGO1 gene. Similarly, the preferential accumulation of 21-nt 5′ uracil sRNAs in germinated spores and during early infection correlates with high-level expression of AGO2 and relatively low expression of AGO1. This suggests that similarly to plants, the 5′ nucleotide of Pgt sRNAs might have a strong effect on preferential loading into different argonautes. In Arabidopsis thaliana, AGO1 and AGO10 bind preferentially small RNAs with a 5′ uracil, whereas AGO2, AGO4, AGO6, AGO7 and AGO9 prefer sRNAs with 5′ adenines and AGO5 5′ cytosines [1]. Our analysis suggests that Pgt AGO2 preferentially loads sRNAs with a 5′ uracil and AGO1 preferentially binds 22-nt sRNAs with a 5′ adenine, which is worthy of investigation in future experimental studies.
We discovered parallels between Pgt sRNAs and plant sRNAs, in particular evidence for an sRNA-directed TE silencing pathway in Pgt that might resemble the RdDM pathway in plants. Such a RdDM-like pathway has thus far not been reported in fungi and might suggest that Pgt uses similar strategies to plants to maintain its highly repetitive genome [41]. The overlap of the late wave Pgt sRNAs with cytosine methylation sites suggests that these sRNAs may function similarly to plant 24-nt siR-NAs to direct methylation to cause transcriptional silencing. The specific expression of one argonaute, one dicer and two RdRPs at the late stage of infection underlines their involvement in such a functionally diversified TE silencing pathway. However, in the absence of a stable transformation system for Pgt, the effect of a loss of the RNAi genes on methylation currently cannot be tested experimentally.
Furthermore, we showed that Hi-C data can be used to define centromeric regions in fungi and uncover the first centromeres in rust fungi. The Pgt centromeres are highly repetitive, hyper-methylated regions with exceptional sequence divergence, unexpectedly even between some haplotypes. Highly repetitive loci such as centromeres can generate sRNAs which in turn are required for epigenetic silencing [42]. Centromeres are essential for chromosome segregation during cell division and heterochromatin is vital to maintain the integrity of the centromeres. Eukaryotic centromere sequences are highly diverse in sequence and can differ even between closely related species [43]. In fungi, their lengths range from point centromeres (<400 bp), short regional centromeres (>400 bp, <20 kb) to large regional centromeres (>20 kb) [44]. For example, the fission yeast S. pombe centromeres span between 35 and 110 kb and resemble those of vertebrates (central core domain of nonrepetitive AT-rich DNA flanked by outer repeats), where the kinetochore is embedded in the heterochromatin of the outer repeats. In Neurospora crassa, centromeres are repetitive, AT-rich 150 to 300 kb long regions [45]. The human fungal pathogen Cryptococcus harbours large regional centromeres that are rich in LTR retrotransposons [16]. The formation of silent heterochromatin in some yeasts depends on siRNAs derived from pericentromeric regions and on the RNAi machinery [12,46]. Genes placed near centromeric chromatin are typically silenced [47,48], with the strongest repression at the outer repeats [49,50]. In the rice blast fungus Magnaporthe oryzae, centromeres span 57-to 109-kb transcriptionally poor regions and share highly AT-rich and heavily methylated DNA sequences [51]. Clearly, centromeres are not well-studied in plant-pathogenic fungi and had thus far not been described in rust fungi. The high activity of Pgt centromeric sRNAs in the later stages of infection might ensure that the genome is passed on stably to subsequent generations through methylation and condensation of centromeres. The TE silencing function can have a silencing effect on nearby genes, and this seems to occur in some Pgt genes that are close to or overlap with methylated TEs. In plants, insertion of TEs near genes can provide cis-elements for stressresponsive or tissue-specific expression, and the expression level can be modulated by DNA methylation and/or histone modification at the TEs due to siRNA targeting. It is likely that a similar DNA methylation or histone modification mechanism exists in Pgt.
In contrast to plants, the roles of sRNAs in epigenetic silencing pathways of fungal plant pathogens have been understudied and previous research has focused heavily on the roles of sRNAs in cross-kingdom gene silencing [52,53]. Several cross-kingdom RNAi interactions between fungal pathogens and plants have been uncovered. Some Botrytis cinerea sRNAs silence Arabidopsis and tomato genes involved in plant immunity and are mainly derived from LTR retrotransposons and are 21 nt in size with a 5′ uracil [54], whilst Arabidopsis cells secrete exosome-like extracellular vesicles to deliver sRNAs into the fungal pathogen Botrytis cinerea to silence pathogenicity genes [55]. A wheat stripe rust fungus Puccinia striiformis f. sp. tritici 20-nt sRNA has been suggested to target the wheat defence pathogenesis-related 2 (PR2) gene [56]. The fungal pathogen Sclerotinia sclerotiorum produces mainly 22-23-nt sRNAs with a 5′ uracil from repeat-rich regions during infection [57]. Whilst Pgt might also use sRNAs to target host genes for silencing, we found strong support for endogenous roles of Pgt sRNAs during infection. Using the ShortStack software which uses criteria tailored to plant miRNA properties, we predicted only a handful of Pgt sRNAs that fulfil the criteria for miRNAs and thus might represent sRNAs involved in gene silencing. However, it is possible that Pgt produces a larger contingent of miRNA-like RNAs that follow currently unknown fungal-specific rules. Loci with some, but insufficient, evidence for miRNA biogenesis (such as strandedness) using the ShortStack software might be worth exploring as miRNA-like candidates in the future [58]. We did not perform target prediction of Pgt sRNAs due to the lack of fungal-specific targeting rules and the high false-positive rate of miRNA target prediction tools [59]. In future studies, sRNA sequencing specifically of haustorial tissues can help to elucidate if haustoria are potentially sites of sRNA transfer between the host and rust fungi [60] and then we can combine target prediction with gene expression data to reduce the number of false-positive predictions.

Conclusions
The wheat stem rust disease caused by Puccinia graminis f. sp. tritici (Pgt) is one of the most devastating crop diseases and of significant global interest. Our work uncovers fundamental characteristics of the stem rust RNAi machinery, DNA methylation in rust fungi and the first characterization of centromeres in rust fungi. We found evidence suggesting an sRNA-directed DNA methylation pathway in rust fungi, with some similarity to the RdDM pathway in plants. Pgt induces waves of early and late infection sRNAs with differing profiles and up-regulates a subclass of RNAi genes during late infection. Future research can use this knowledge to optimize methods of host-induced gene silencing where sRNAs from the plant operate via the fungus's own RNAi machinery to silence pathogen genes important for causing disease.

Hi-C data analysis and centromere identification
Previously published Hi-C data [26] available in NCBI under BioProject PRJNA516922 was analysed using HiC-Pro 2.11.1 [61] and contact maps were plotted with HiCExplorer's hicPlotMatrix [62] to identify centromeric regions.
Chromosomes and centromeric regions were aligned using DGenies [63] and regions of macro-synteny were extracted from the minimap2 [64] paf alignment produced by DGenies. Pairwise k-mer distance estimations were calculated using Mash 2.2.0 with the function mash triangle [65] and clustered as a dendogram (hclust with the method ward.D2).

Gene expression analysis and repetitive element annotation
Previously published RNA-seq data (0 dpi, 2 dpi, 3 dpi, 4 dpi, 5 dpi, 6 dpi, 7dpi) was used for the gene expression analysis [28]. This was complemented with previously published RNA-sequencing data of Pgt 21-0 germinated spores and haustorial tissue [27]. We used Salmon 1.1.0 to align reads to the Pgt 21-0 transcripts [26] and to estimate transcript abundances in each sample (salmon index -keepDuplicates and salmon quant -validateMappings). We used tximport and DESeq2 to assess gene differential expression [66,67]. Differentially expressed genes were annotated with the B2GO software and GO term enrichment analyses were performed with B2GO and the category molecular function [68].
Transcription levels on the chromosomes were obtained by aligning the RNA-seq reads to the Pgt chromosomes [26] with HISAT2 2.1.0 and default parameters [69]. Bedtools 2.28.0 was used to slice the chromosomes into windows (bedtools makewindows) and the aligned reads per genomic window were counted (bedtools coveragecounts) and normalized to reads per million.
Repeat regions were annotated as described previously [70,71] using the REPET pipeline v2.5 [31,32,72] for repeat annotation in combination with Repbase v21.05 [73]. For de novo identification, we predicted repeats for both haplotypes independently using TEdenovo. We combined the resulting de novo repeat libraries without removing redundancies. We annotated both haplotypes with the combined TEdenovo repeat library and two additional repeat libraries from Rebase (repbase2005_ aaSeq and repbase2005_ntSeq). We generated superfamily classifications as described previously [70,71].

Methylation sequencing and analysis
To prepare material from germinated Pgt spores, freshly harvested spores (450-500 mg) were sprinkled on top of autoclaved Milli-Q (MQ) water in a glass baking tray and were incubated at 100% humidity at 20°C in dark for 18 h before harvesting. Once germination was assessed and verified by bright field microscopy, the layer of germinated spores was collected using a glass slide and the remaining moisture was removed as much as possible with a paper towel. Dried samples were snap frozen in liquid nitrogen and stored at −80°C until processed for DNA extraction. High molecular weight DNA was extracted from germinated spores following the phenol:chloroform method with minor modifications. Briefly, the germinated spores were ground with liquid nitrogen into fine powdered material, suspended in lysis buffer followed by cholorform:isoamyl alcohol (24:1) steps twice and incubated in 200 μl of 10 mM Tris pH 8 and 200 μl of Tris-EDTA buffer (TE) at room temperature overnight. For secondary cleanup, DNA bound to Sera-Mag TM SpeedBead magnetic carboxylatemodified particles (GE Healthcare), washed 3 times with ethanol and eluted with 10 mM Tris-HCl pH 8. The DNA was size selected with a Short Read Eliminator (SRE) XS 10 kb (Circulomics) according to dx.doi.org/1 0.17504/protocols.io.betdjei6.
Triticum aestivum cultivar Rangcoo seeds were sown and stratified at 4°C with no light for approximately 48 h. To germinate, the pots were transferred to a growth cabinet set at 21°C with 60-70% relative humidity and a 16-h light cycle. Six days after sowing (seedling approximately 6 cm tall), plant growth inhibitor maleic hydrazide was added, 20 ml per pot at a concentration of 1.1 g/l. Infection of wheat seedlings was performed 7 days after sowing. Three hundred to 400 mg of dormant Pgt urediniospores were heat-shock activated for approximately 3 min at 42-45°C. The urediniospores were then suspended in Novec TM 7100 solvent (3M TM ) and immediately used to inoculate the wheat seedlings, applying the suspension across leaves homogeneously using a flat paintbrush. Pots were then placed into a plastic container, leaves sprayed with Milli-Q water (Merck), sealed with a lid and placed in a secure transparent plastic bag. The bag was transferred to a growth cabinet set at 23°C with 60-70% relative humidity and a 16-h light cycle. After 48 h, the plants were removed from the bag. Infected leaves (day 7) were ground into a fine powder in a mortar and pestle using liquid nitrogen and lysed with the lysis buffer followed by DNA binding, ethanol wash and elution. Notable changes include the addition of 6 mM EGTA to the lysis buffer. DNA was further purified which included RNA/proteins removal, cleanup with cholorform:isoamyl alcohol (24:1), shearing with 5 passes through a 29 gauge needle and size selected with a Short Read Eliminator (SRE) XS 10 kb (Circulomics) according to dx.doi.org/10.17504/protocols.io.betdjei6.
To perform native DNA sequencing, Oxford Nanopore Technologies (ONT) portable MinION Mk1B was adopted. Native DNA sequencing libraries were constructed according to the manufacturer's protocol 1D genomic DNA by ligation (SQK-LSK109), using 3 μg of DNA input. Briefly, DNA was repaired (FFPE DNA Repair Mix, New England BioLabs® (NEB)), end-prepped with an adenosine overhang (Ultra II end repair/dA-tailing module, NEB), purified (AMPure XP, Beckman Coulter) and an ONT adapter was ligated each end (Quick T4 Ligation Module, NEB). Following, the library was cleaned once more and quantified using a Qubit Fluorometer (Thermo Fisher Scientific). A MinION FLO-MIN106 9.4.1 revD flow cell was primed, approximately 300 ng of library was loaded and sequenced according to the manufacturer's instructions (ONT). When the majority of pores became inactive (approximately 24 h), the flow cell was treated with DNase I and another 300 ng of library was loaded, according to the ONT nuclease flush protocol. The nuclease flush protocol was performed 2-3 times, until the flow cell was expended.

Bioinformatic processing and methylation calling
Raw fast5 reads were basecalled with Guppy version 4.4.1+1c81d62 (ONT), using the --fast5_out option to store fastq calls within fast5 files. Sequencing output and quality were inspected with the NanoPack tool NanoPlot version 1.28.2 [74]. We mapped all reads from germinated spores and infected wheat leaf samples against the Pgt 21-0 genome [26] using minimap2 version 2.17-r941 [64] evoking the nanopore flag (-map-ont). We extracted reads that mapped to the Pgt 21-0 genome for downstream analysis and obtained 9.2 Gb and 8.9 Gb aligned raw reads for germinated spores and infected wheat leaf samples, respectively. This gave rise to an average genome coverage of 48x and 35x mapped sequence (cigar), respectively, for downstream methylation calling with Nanopolish and Tombo. De novo identification of the DNA modifications 5mC and 6mA was performed using Tombo version 1.5.1 [75]. We called 6mA and 5mC methylation with Tombo 1.5.1 following standard workflows and described in our github repository https:// github.com/Team-Schwessinger/Pgt210Methylation. We converted the resulting Bigwig files into Bed6 files by calling sites as methylated that had per site methylation frequency above 0.5. Di-and trinucleotide frequencies were calculated with compseq from EMBOSS 6.6.0 [76]. We called CG methylation using nanopolish 0.12.3 and pycometh v0.4.2 (https://doi.org/10.5281/zenodo. 3629254) with minimum per site read coverage of two. We aggregated the methylation frequency in 500, 1000 and 5000 base windows and on CpG sites using pycometh. These files were used for downstream CG methylation analysis. Detailed analysis instructions for data processing and methylation calling can be found at h t t p s : / / g i t h u b . c o m / T e a m -S c h w e s s i n g e r / P g t 2 1 0Methylation. Transposable elements and genes were called as methylated if at least two methylation sites mapped to them.
Small RNA sequencing, read processing, filtering and alignment Small RNA sequencing data was obtained from the same infected leaf samples as the previously published RNAseq data [28]. For rust infection, host plants (cv. Sonora) were grown at high density (~25 seeds per 12-cm pot with compost as growth media) to the two leaf stage (~7 days) in a growth cabinet set at 18-23°C temperature and 16 h light. Spores (−80°C stock) were first thawed and heated to 42°C for 3 min, mixed with talcum powder and dusted over the plants. Pots were placed in a moist chamber for 24 h and then transferred back to the growth cabinet. Leaf samples were harvested at specified days after inoculation, snap frozen and stored at −80°C until use. One hundred milligrammes of freshly collected spores was germinated overnight in four 15-cm petri dishes, each containing 200ml sterile RO water. Germinated spores were harvested via filtering through nylon mesh 15 μm. Small RNAs were extracted from the germinated spores and infected leaf samples with the Purelink microRNA Isolation Kit from Invitrogen. We sequenced sRNAs (50-bp reads) from the following five conditions (3 replicates each) on the Illumina HiSeq: germinated spores, uninfected wheat and infected wheat at 3 dpi, 5 dpi and 7 dpi. Adapters were trimmed using cutadapt (-m18 -M28 -q30 -trim-n -discard-untrimmed) [77]. Untrimmed reads, reads shorter than 18 nts or reads larger than 28 nts were discarded and flanking N bases were removed from each read [77]. FASTQC was run on the resulting reads (http://www. bioinformatics.babraham.ac.uk/projects/fastqc/).
To eliminate reads derived from non-small RNAs, we first generated a database set of potential contaminating RNA sources. Triticum aestivum and Puccinia tRNAs, rRNAs and spliceosomal RNAs were collected from the RNACentral database [78] as well as the tRNA and rRNA RFAM families RF00001, RF00002, RF00005, RF01852, RF01960 and RF02543 [79], snoRNAs from dbsnOPY, 5S and 23S ribosomal RNAs from the European Nucleotide Archive (ENA) and the tRNA/rRNA file from the sRNA workbench [80]. This set of potential contaminant sequences was de-duplicated using bbmap and its tool dedupe.sh (sourceforge.net/projects/bbmap/ ). Reads that mapped to this set were removed using bowtie 1.1.2 [81]. To assess read length distributions across the different samples, clean small RNA reads were mapped to the wheat genome IWGSC RefSeq v1.0 [82] and PGT 21-0 genome [26] using bowtie 1.1.2 (alignment settings: no mismatches allowed -v0; report all alignments: -a -best -strata).
Pgt sRNA prediction, differential expression analysis and allelic sRNA prediction To annotate and quantify high-confidence Pgt and wheat small RNAs from the sequencing data, we used the ShortStack 3.8.5 software [34] on the clean sRNA reads (--bowtie_m all -foldsize 1000). ShortStack predicts and quantifies sRNA-producing loci in a genome based on clusters of sRNA reads and miRNA-producing loci according to a series of tests, such as strandedness of the locus and the predicted precursor secondary structure. We further filtered the predicted sRNA clusters to include only those where ≥ 80% of reads are within 20-24 nts of length (recommended procedure in ShortStack to avoid degradation products) and where the cluster has ≥ 5 reads per million. The ShortStack software outputs sRNA cluster properties such as the most abundant sRNA (termed sRNA candidate) in the cluster, strandedness of the locus, miRNA annotation and phasing [34]. Strandedness of sRNA loci is determined by forcing the bowtie aligner to select one strand or the other with a probability that is proportional to the number of best sites on the strand. Stranded loci are typical of miRNA production in plants and are a requirement for annotation of a locus as a miRNA by ShortStack. We used the read counts returned by ShortStack for all predicted sRNA clusters and used edgeR [83] to assess which are differentially expressed at any of the infection stages versus germinated spores (FDR < 0.05, fold change > 2; calcNormFactors(method=none)).
To assess if sRNAs have a homologous counterpart, we re-mapped the sequencing reads that define an sRNA locus to the remainder of the genome using bowtie 1.1.2 (alignment settings: two mismatches allowed -v2; report all alignments: -a -best -strata). If more than 25% of bases in an sRNA locus are covered by those mapped reads (using bedtools coverage version 2.28.0), it is marked as a candidate homolog. The sRNA locus with the highest coverage amongst the candidate homologs is returned as the predicted allelic counterpart. Circos 0.69.5 [84] was used to plot the links between homologous sRNAs across the chromosomes.
To assess the relationships of sRNAs and TEs, we remapped sRNAs to the genome using bowtie 1.1.2 (alignment settings: no mismatches allowed -v0; report all alignments: -a -best -strata). We reported repeats that overlap with those mapped sRNAs using bedtools intersect [85]. We then retrieved the genes that overlap with repeats using bedtools closest.

Northern blots
Total RNA was extracted from germinated spores, infected tissues and uninfected wheat plants using TRIzol® Reagent (Ambion® USA) according to the manufacturer's instructions. Five microgrammes of total RNA from germinated spores and 10 μg of total RNA from infected or uninfected tissues were separated in 17% denaturing acrylamide gel, electroblotted and UV crosslinked onto HyBond-N+ membrane (GE Healthcare). The filters were hybridized with 32P-labelled antisense oligonucleotides against miRL-1 or miRL-2. To determine the exact size of sRNA, the filters were stripped and re-hybridized with antisense oligonucleotide probe against the wheat endogenous miR168, known to be 21 nt in size. As a loading control for Pgt-derived RNA, the same filters were stripped and hybridized again with an antisense oligonucleotide probe specific for the Pgt U6 RNA. Sequences of the oligonucleotide probes are as follows: For miRL-1: 5′-ACCACATGACTAACGCTACCCT-3′; for miRL-2: 5′-TATGTCCTTCTTTTCATCAACA-3′; for wheat miR168: was 5′-TTCCCGACCTGCACCAAG CGA-3′; the probe sequence for detecting Pgt U6 was 5′-TCTTCACCCGTAGGTGAATCCATTCTGACTAC AT-3′.
Additional file 1: Fig. S1. Hi-C contact maps for the 18 chromosomes of haplotype A show the presence of centromeres in each chromosome.
Additional file 2: Fig. S2. Hi-C contact maps for the 18 chromosomes of haplotype B show the presence of centromeres in each chromosome.
Additional file 3: Fig. S3. 150 kbp bins with interaction frequency > 5 in the Hi-C interaction matrix are shown between chromosomes 1A, 2A, 3A and 4A. The putative centromeric regions share strong connections with each other. Densities of expressed genes and coverage of repetitive