Capturing the global diversity of the Pst population
To reduce the complexity of the genomic data generated from Pst-infected wheat samples, we aimed to define regions of the Pst genome showing high variability between pathogen strains that could be subsequently amplified for sequencing on the MinION platform from Oxford Nanopore Technologies. The first step was to capture the diversity of the Pst global population. To achieve this, we carried out transcriptome sequencing on 100 Pst-infected wheat samples collected between 2015 and 2017 from nine countries, including those in eastern and southern Africa, Europe, North America and Asia (Additional file 1: Table S1). Total RNA was extracted from each sample and subjected to RNA sequencing (RNA-seq) analysis using the Illumina HiSeq platform and our previously described field pathogenomics strategy [11]. To maximise the geographical distribution of Pst isolates, we combined these 100 RNA-seq datasets with previously published genomic and transcriptomic datasets from a further 201 Pst strains spanning a total of 19 countries, including Chile, New Zealand, Pakistan and an array of European countries [11, 12] (Additional file 1: Table S1). Raw reads were filtered for quality, and data from each Pst sample were independently aligned to the Pst race PST-130 reference genome [13]. An average of 37.3% (± 18.2%, S.D.) reads aligned to the reference genome for the combined RNA-seq datasets, and 82.7% (± 4.9%, S.D.) reads aligned for the genomic datasets [11] (Additional file 1: Tables S2 and S3). Overall, the data from this global collection of Pst isolates comprised 280 transcriptomic and 21 genomic datasets from Pst isolates spanning 24 countries that could then be used for subsequent population genetic analysis.
To determine the genetic relationships between these 301 global Pst samples, we carried out phylogenetic analysis using the third codon position of 2034 PST-130 gene models (589,519 sites) using a maximum-likelihood model (Additional files 2 and 3). Pst isolates tended to cluster based on their geographical origin, with only four of the 14 divisions containing Pst isolates that spanned continental boundaries (Fig. 1a). These four clades included (i) clade 2 with Pst isolates from China and the USA, (ii) clade 9 with Pst isolates from Europe and South Africa, (iii) clade 10 containing Pst isolates from Ethiopia and New Zealand and (iv) clade 14 containing isolates from Europe and New Zealand (Fig. 1a). This represents relatively recent shared ancestry between the populations within these four clades, which could be indicative of long-distance transmission of Pst strains either between these regions or from a common independent source area.
We carried out multivariate discriminant analysis of principal components (DAPC) to further define subdivisions within the global Pst population. First, we generated a list of 135,372 synonymous single nucleotide polymorphisms (SNPs), of which 135,139 were biallelic in at least one Pst sample and were therefore used for DAPC analysis. Assessment of the Bayesian Information Criterion (BIC) supported division of the Pst isolates into five groups of genetically related Pst isolates (Additional file 4). However, due to the high level of diversity within the global Pst population, this initial DAPC analysis was able to separate only Pst populations with high levels of genetic differentiation and was unable to resolve lower levels of within-group variation [14] (Fig. 1b). For instance, group 1 (C1) contained Pst isolates from Pakistan, Ethiopia, Europe and New Zealand, and group 2 (C2) contained Pst isolates from China and two European races that have been shown to be genetically distinct in previous population studies [12, 15]. Therefore, we performed further DAPC analysis on each of the five population groups independently and, following analysis of the BIC, Pst isolates were separated into clear subsets of homogenous groups of individuals that better reflected the phylogenetic clustering (Fig. 1b; Additional file 4). Overall, this analysis indicated that the global Pst population is highly diverse and, with only a few exceptions, consists of geographically isolated groups of distinct homogenous individuals.
A subset of genes can be used to capture the global diversity of Pst isolates
To identify specific Pst genes contributing to the separation of isolates into distinct groups in the population genetic analysis, we used comparative analysis to find the most variable genes among the 301 global Pst isolates that were conserved across all Pst isolates analysed. First, we calculated the number of SNPs per kilobase for each gene from alignments of sequences representing the 301 Pst isolates against the PST-130 reference genome [13]. SNPs per kilobase values were calculated by normalising the total number of SNPs found in the coding sequence of each gene across the 301 Pst isolates relative to the length of the coding sequence for each gene. A total of 1690 genes were identified as polymorphic (SNPs/kb ≥ 0.001) between Pst isolates and subsequently utilised for phylogenetic analysis with a maximum-likelihood model. Importantly, the sequences from these 1690 polymorphic genes were sufficient to reconstruct the topology of the global Pst phylogeny (Additional file 5).
To determine the minimum number of gene sequences required to accurately reconstruct the global phylogeny, we ordered the 1690 genes based on the number of polymorphic sites across the 301 Pst isolates (Fig. 2a). We then selected 1006, 748, 500, 402, 301, 204, 151 and 100 of the most polymorphic genes using progressively increasing cut-off values for SNPs per kilobase (0.006, 0.0105, 0.018, 0.023, 0.0285, 0.036, 0.042 and 0.051, respectively) and carried out phylogenetic analysis as described above with each of these subsets (Additional file 5). We noted that a single Pst isolate from clade 9 was mis-assigned to clade 4 in the phylogenies reconstructed from fewer than 500 genes (Additional file 5). This inconsistency was likely due to poor gene coverage for this Pst isolate when the data were aligned to the PST-130 reference genome; for instance, 96.5% of bases had less than 20× coverage when using 402 Pst genes to reconstruct the phylogeny. Therefore, this Pst isolate (14.0115) was excluded from the general evaluation. Overall, we concluded that whilst minor changes in clade ordering were observed when using sequence data from less than 500 genes, sequence data from as few as 100 genes were sufficient to generate a similar phylogeny topology (Additional file 5) and assign Pst isolates to the 14 previously defined groups.
The next step was to use the minimal number of polymorphic genes required to represent Pst population diversity to define a subset of genes for PCR amplification in preparation for sequencing on the MinION platform. We reasoned that sequencing a small subset of highly variable genes would reduce the volume of data generated and associated cost per sample, whilst maintaining our ability to define individual strains. We selected the 500 most polymorphic genes between Pst isolates and within this subset randomly selected 250 of these genes; oligonucleotides were successfully designed for 242 genes (Additional file 1: Table S4). Given that a minimum of 100 genes was sufficient to accurately assign Pst isolates, the additional 142 genes were included to ensure that Pst isolates could be correctly assigned even if a large proportion (up to 58%) of the genes failed to amplify under field conditions. To validate that the 242 polymorphic genes were not biased in their selection by a high degree of divergence from the reference isolate PST-130 for any particular group of individuals, we assessed the total number of SNPs across these 242 genes for Pst isolates belonging to each of the five major genetic groups identified through DAPC analysis (Fig. 1b). The SNPs were distributed across all the major genetic groups, with the least number of SNPs identified in Pst isolates of genetic group 2 and the greatest number identified in Pst isolates from genetic group 4 (Fig. 2b). The low differentiation of Pst isolates in genetic group 2 from the PST-130 reference isolate likely reflects a close genetic relationship. Finally, we confirmed that the 242 genes selected could be used successfully to reconstruct the global phylogeny and assign Pst isolates to the 14 previously defined groups (Fig. 2c; Additional files 6 and 7). Overall, this analysis illustrated that using sequence data from a minimal set of 242 polymorphic Pst genes was sufficient to accurately genotype Pst isolates and re-construct a comparable phylogeny to that achieved from full-genome or transcriptome sequencing.
Genes selected for amplicon sequencing are distributed across the Pst genome and the majority encode enzymes
To characterise the 242 Pst genes selected for sequencing on the MinION platform, we carried out positional and functional annotation. To assess the distribution of the 242 polymorphic genes across the Pst genome, we identified their genomic locations in the highly contiguous Pst-104 reference genome [16]. For 241 of the 242 genes, near-identical (> 94% pairwise identity) hits in the genome were obtained when gene sequences were mapped to the genome using minimap2 [17]. These 241 genes were distributed across a total of 135 genome scaffolds, with the majority of genes (60%) located on scaffolds that contained only one of the 241 genes (Additional file 1: Table S5). Only 10 scaffolds contained more than five of these genes, suggesting that the majority of the 241 genes were scattered across the genome and not grouped in gene clusters (Fig. 3a). Using gene ontology (GO) term analysis, we found that the majority (64%) of the 242 genes encoded proteins with enzymatic functions (GO: 0003824—catalytic activity; GO: 0005488—binding) and were involved in different metabolic and cellular processes (Fig. 3b; Additional file 1: Table S5). Overall, this analysis indicates that 241 of the 242 Pst genes selected are well distributed across the Pst genome and are enriched for functions in fungal metabolism.
Comparative analysis of the Illumina and Oxford Nanopore sequencing platforms
To assess the suitability of the mobile MinION sequencer for population diversity analysis using the 242 Pst genes selected, we carried out a comparative analysis with data generated on the Illumina MiSeq platform, which is frequently used for this purpose [18]. Four Pst-infected wheat samples were collected in 2017 in Ethiopia (Additional file 1: Table S1). Following genomic DNA extraction, each of the aforementioned 242 Pst genes was amplified from each sample. Each gene was then used for amplicon sequencing on both the MinION and MiSeq platforms. A total of 6.9, 3.6, 6.2 and 6.4 million paired-end Illumina reads and 109, 102, 128 and 113 thousand MinION reads were generated for each of the four Pst-infected wheat samples (17.0504, 17.0505, 17.0506 and 17.0507 respectively). Following base calling and quality filtering, reads were aligned to the gene sequences for the 242 genes from the PST-130 reference [13] (Additional file 1: Table S6 and S7). For each Pst-infected sample, consensus sequences were generated for each of the 242 genes, using data produced on the Illumina MiSeq platform. Each consensus gene set separately incorporated the SNPs identified within the gene space by mapping the reads from each of the four Pst isolates against the gene sequences of the 242 genes. These four sets of sequences formed an accurate baseline for comparison with sequence data generated on the MinION sequencer.
To evaluate the minimum depth of coverage required to obtain similar levels of accuracy on the MinION sequencer, we performed a comparative analysis between the two platforms. Sequence data generated on the MinION platform were used to create consensus sequences for each of the aforementioned 242 Pst genes using varying depths of coverage for each of the four Pst-infected wheat samples. The percentage identity of these consensus sequences was then determined through comparative analysis with the MiSeq baseline consensus sequences. A minimum depth of 20× coverage on the MinION sequencer was sufficient to achieve 98.74% sequence identity between the two datasets (Fig. 4a).
We then investigated whether there was any notable selective bias during library preparation and sequencing of individual genes using either the MiSeq or MinION platform. We determined the percentage coverage for each of the 242 genes sequenced for the four Pst isolates on the two sequencing platforms. The average coverage per gene for the MiSeq (0.41 ± 0.02, S.E.) and MinION (0.41 ± 0.03, S.E.) platforms was comparable (Fig. 4b).
Using the predefined 20× coverage level, we evaluated the required run time to achieve this level of coverage across all 242 selected Pst genes on the MinION platform. Assuming equal coverage of all genes, we determined that to reach 20× coverage for all 242 genes in each of the four samples (4840 reads) would take less than 30 min from starting the MinION sequencing run [18.75 (17.0504), 21.77 (17.0505), 17.65 (17.0506) and 19.20 (17.0507) minutes] (Additional file 1: Table S8).
Finally, using the minimum level of 20× depth of coverage for data generated on the MinION sequencer, we defined the number of SNPs per gene in each of the four MinION datasets. This was then compared with SNP analysis using sequence data generated on the MiSeq platform. SNP profiles for each of the samples sequenced on the MinION and the MiSeq platforms were largely comparable, with the general trend being that more SNPs (compared with the reference) were identified when sequencing was carried out on the MinION platform (Fig. 4c; Additional file 1: Table S9). In particular, we observed that several positions that were designated as being homokaryotic from data generated on the MiSeq platform appeared as heterokaryotic when using the MinION sequencer. The average ratio of heterokaryotic to homokaryotic nucleotide positions using the MiSeq platform was 0.01 (± 0.0002, S.D.), which was 20% higher (0.012 ± 0.0004, S.D.) when the MinION sequencer was used (Additional file 1: Table S10). However, as the overall average sequence identity between samples sequenced using the MiSeq and MinION platforms was > 98%, we concluded that when a minimum of 20× depth of coverage is achieved, the data generated on the MinION sequencer are largely comparable in accuracy to those from the MiSeq platform and therefore should be suitable for population genetic analysis.
Pst isolates from Ethiopia in the 2017/2018 wheat crop season are genetically closely related
To further assess the ability of the MinION-based sequencing platform to accurately define Pst genotypes in field-collected infected samples, we expanded our analysis to a larger sample of 51 Pst-infected wheat samples collected in Ethiopia predominantly during the 2017/2018 growing season (Additional file 1: Table S1). DNA was extracted from each sample independently, and each of the aforementioned 242 Pst genes was amplified and prepared for amplicon sequencing on the MinION platform. In parallel, RNA was extracted and RNA-seq analysis was undertaken using the Illumina HiSeq platform and our field pathogenomics strategy for comparison [11]. An average of 114,519.37 (± 91,448.42, S.D.) reads per library were generated using the MinION sequencer and a total of 23,434,565.49 (± 2,468,438.63, S.D.) reads per library were generated on the HiSeq platform (Additional file 1: Tables S7 and S11). Following base calling and data filtering, reads generated on the HiSeq or MinION platforms were aligned independently for the 51 Pst isolates to sequences of the 242 Pst genes selected.
We then carried out the phylogenetic analysis as described above, using data from either the MinION or HiSeq platforms independently (Fig. 5; Additional files 8, 9, 10, 11 and 12). To compare the Ethiopian Pst isolates with the global Pst population groups, we also included sequence data for the 242 genes from the 301 global Pst isolates in the phylogenetic analysis. The positioning of the 51 Ethiopian samples in the phylogenies was similar between the two datasets, with the 51 Pst field isolates grouping in two closely related clades in both cases (Fig. 5 and Additional file 8). This analysis further supports the conclusion that when a sufficient level of coverage is used, data generated on the MinION platform can be used to accurately define Pst genotypes.
Assigning Pst isolates to known genetic groups defined by SSR marker analysis
To compare the phylogenetic clades with previously defined Pst genetic groups based on simple sequence repeat (SSR) marker analysis and pathogenicity testing [15, 19,20,21], we selected 13 additional Pst isolates of diverse origin representing these groups (Additional file 1: Table S12). DNA was extracted from each sample independently, and the 242 Pst genes were amplified and prepared for sequencing on the MinION platform. Following base calling and quality filtering, reads were aligned to sequences of the 242 PST-130 genes. The resulting data were then combined with those from the 301 global Pst isolates and the 51 Ethiopian Pst isolates collected predominantly during the 2017/2018 field season, and phylogenetic analysis was performed (Fig. 5; Additional files 9 and 10).
The 13 Pst isolates representing previously defined Pst groups and races clustered in the phylogeny as follows. US312/14 (a.k.a AR13–06), representing a new group of isolates in North America carrying virulence to the yellow rust (Yr) resistance gene Yr17, grouped in a clade with other recent Pst isolates that were collected in the USA and Canada in 2015 and 2016. AZ160/16 and AZ165/16 belonging to the PstS2, v27 group, which has been prevalent in eastern and northern Africa and western Asia, grouped with Pst isolates from Ethiopia. UZ180/13 and UZ14/10, both representing the PstS5, v17 group prevalent in central Asia, was basal to a clade of Ethiopian Pst isolates. UZ189/16 (PstS9, v17), frequently found in central Asia, formed a distinct branch in the phylogeny. ET08/10, representative of the PstS6 group and carrying virulence to Yr27, formed a long unique branch. SE225/15, which belongs to the PstS4 race (a.k.a. ‘Triticale2006’) and is frequently found on triticale in Europe, formed a distinct branch close to Pst isolates from Ethiopia. KE86058, a representative of the PstS1 aggressive strain recovered from the ‘Stubbs Collection’, grouped with isolates from Ethiopia. DK14/16 and SE427/17 representing the ‘Warrior’ PstS7 group, DK52/16 representing the ‘Kranich’ PstS8 group and DK15/16 representing the ‘Warrior(-)’ PstS10 group, shown to be analogous to ‘genetic group 1’, ‘genetic group 5-1’ and ‘genetic group 4’, respectively [12], clustered accordingly in the phylogeny (Fig. 5). This result illustrates that data generated on the MinION platform for the 242 polymorphic Pst genes can be used to accurately distinguish the genetic groups previously defined from SSR marker-based classification, providing additional support to the methodology herein. Furthermore, the inclusion of these reference Pst isolates in future analysis will enable isolates of similar genetic background to be rapidly identified.
In-field MinION-based diagnostics can define Pst isolates in Ethiopia in real-time
As resource-poor locations frequently bear the brunt of plant disease epidemics, we developed a simplistic Mobile And Real-time PLant disEase (MARPLE) diagnostics pipeline so that the 242 polymorphic Pst genes could be amplified and sequenced on the MinION sequencer for phylogenetic analysis in situ (Fig. 6; Additional file 13). To test our MARPLE diagnostics pipeline, we collected four Pst-infected wheat samples in 2018 and carried out analysis in situ in Ethiopia (Additional file 1: Table S1). As Ethiopia may act as a gateway for new Pst isolates entering Africa from sexually recombining populations in Asia, this pipeline would enable rapid detection of any new Pst strains entering East Africa.
First, DNA was extracted from each Pst-infected wheat sample using a simplified method wherein Pst-infected plant tissue was homogenised, cells lysed and DNA isolated using magnetic-bead-based purification (Fig. 6). Next, a panel of 242 oligonucleotide pairs was used in PCR amplification to enrich for the previously defined set of Pst gene sequences. This enrichment enabled direct analysis of each of the field-collected Pst-infected wheat tissue samples. The 242 oligonucleotide pairs were pooled into four groups, where concentrations were optimised individually to amplify all genes in each individual group (Additional file 1: Table S4). To ensure ease of portability and avoid the need for continuous electricity, PCR amplification was performed using thermostable Taq polymerase and a battery-powered mobile miniPCR machine (Fig. 6; Additional file 1: Table S13). Finally, a simple analysis pipeline independent of internet connectivity was utilised on a laptop computer for phylogenetic analysis of Pst isolates (Fig. 6).
Overall, the entire pipeline from sample collection to completion of the phylogenetic analysis was achieved within 2 days, providing rapid real-time information on the population dynamics of Pst in Ethiopia. The resulting phylogenetic analysis of the four Pst-infected wheat samples illustrated that the late 2018 Pst population in Ethiopia was similar to that defined in the previous 2017/2018 growing season (Fig. 5).