Genomic evidence for recurrent genetic admixture during domestication mediterranean olive trees (Olea europaea)

Background The olive tree (Olea europaea L. subsp. europaea, Oleaceae) has been the most economic perennial crop for Mediterranean countries since its domestication around 6,000 years ago. Two taxonomic varieties are currently recognized: cultivated (var. europaea) and wild (var. sylvestris) trees. To shed light into the recent evolution and domestication of the olive tree, we sequenced the genomes of twelve individuals: ten var. europaea, one var. sylvestris, and one outgroup taxon (subsp. cuspidata). All of them were analysed together with an improved assembly of var. europaea reference genome and the available assembly of var. sylvestris. Results Our analyses show that cultivated olives exhibit slightly lower levels of overall genetic diversity than wild forms, and that this can be partially explained by the occurrence of a mild population bottleneck 5000-7000 years ago during the primary domestication period. We also provide the first phylogenetic analysis of genome-wide sequences, which supports a continuous process of domestication of the olive tree. This, together with population structure and introgression analyses highlights genetic admixture with wild populations across the Mediterranean Basin in the course of domestication. Conclusions Altogether, our results suggest that a primary domestication area in the eastern Mediterranean basin was followed by numerous secondary events across most countries of southern Europe and northern Africa, often involving genetic admixture with genetically rich wild populations, particularly from the western Mediterranean Basin. Based on selection tests and a search for selective sweeps, we found that genes associated with stress response and developmental processes were positively selected in cultivars. However, we did not find evidence that genes involved in fruit size or oil content were under positive selection.


Background
The Mediterranean olive tree (Olea europaea L. subsp. europaea, Oleaceae) is one of the earliest cultivated fruit trees of the Mediterranean Basin (MB). Current classifications recognize two taxonomic varieties of O. europaea: var. sylvestris (hereafter sylvestris, also named oleaster) for wild populations and var. europaea (hereafter europaea) for cultivated forms (1,2). Both varieties are predominantly out-crossing, and have long lifespans, including a long juvenile phase that can last up to 15 years in natural conditions. The natural distribution of the Mediterranean olive encompasses all countries of the MB, although a few wild populations have also been found in northern areas with low occurrence of frost (3).
Cultivars have historically been planted nearby wild populations since ancient times, where they exchange pollen that have resulted in effective crop production and historical hybridization (4).
There is a large body of evidence pointing to the eastern MB as the cradle of the first domestication event of olives. According to archaeological, palaeobotanical, and genetic studies, the crop was domesticated from eastern wild progenitors around 6,000 years ago (5)(6)(7)(8). Once superior individuals were selected, clonal propagation made their multiplication and the fixation of valuable agronomic traits possible. Clonal propagation also facilitated the spread of cultivars from the eastern to the western MB via classical civilizations such as Phoenicians, Greeks, or Romans. After six millennia, olive domestication has resulted in a vast number of cultivars of uncertain pedigree that are often geographically restricted to local areas (9).
It remains unclear whether olive cultivars derived from a single initial domestication event in the Levant, followed by secondary diversification (8,10,11), or whether cultivated lineages are the result of more than a single, independent primary domestication event (7,(12)(13)(14)(15).
Previous studies based on plastid and nuclear markers have suggested controversial but not necessarily incompatible domestication scenarios. The genealogical reconstruction of plastid lineages have yielded unresolved phylogenies at the base of domesticated lineages due to low plastid diversity (8,16,17). For instance, 90% of olive cultivars across the MB share the same "eastern like" plastid haplotype (17). This general result is congruent with archaeological data (5), which suggests a major domestication event in the Levant, possibly followed by recurrent admixture events with local wild olives that would have contributed to the crop diversity. However, nuclear markers showed a more complex pattern. Olive cultivars clustered together into three different gene pools, with a rough geographical correspondence to the eastern, central and western MB (12,(18)(19)(20)(21). The relationship between these groups also showed interesting features (12): the western group (southern Spain and Portugal) retained the fingerprint of a genetic bottleneck, and surprisingly, was closely related to cultivated accessions from the Levant (12). The central MB group, which also included the cultivars from eastern Spain (Catalonia, Valencia and Balearic Islands), showed signals of recent and extensive admixture with local wild populations and relatively high plastid diversity compared to the other groups of cultivars. These differential patterns, along with the fact that many cultivars from the central MB retain wild-like phenotypic characteristics, opened the controversial question of a possible minor domestication center for olives in the central MB (10,14).
Approximate Bayesian Computing (ABC) models were applied to infer the demographic history of olives and were consistent with a primary domestication event in the East (8,12).
These models also highlighted the paramount role of admixture to account for the diversity of the crop. This feature was particularly predominant in cultivars from the central MB, wherẽ 20% of the genetic diversity of olives may have been acquired via introgression with local wild populations (12). So far, genetic and archeological sources of evidence have agreed with the existence of a major center of domestication for olives in the Levant Table 1). The ten cultivars were carefully selected to represent: (i) the main cultivar diversity from the most important areas of Mediterranean olive cultivation.
For instance, 'Arbequina' is the most international cultivar due to its adaptation to high density planting designs, 'Picual' covers more than 1. 'Chemlal de Kabilye' -E3.2; the rest of cultivars -E1.1). For sylvestris, previous phylogenetic results and field experience led us to choose a wild individual from an isolated area in order to avoid potential feral or highly introgressed trees. Hence, we sampled a tree from a location by the coast near the Cantabrian mountains (northern Spain) where the olive tree has not been historically cultivated and that is 200 km distant from current plantations.
These populations had previously been screened using fingerprinting techniques (3) and Sanger sequencing (17).
The twelve newly obtained sequences complement the two available genomes for the species: the cultivar Farga from eastern Spain, for which we provide here an improved assembly by anchoring it to a genetic map, and an oleaster (var. sylvestris) from Turkey (22,23). The analysis of these genomes is meant to shed light on the recent evolution and domestication of the olive tree. In particular, we addressed the question whether genome wide variation data can disentangle scenarios of one versus multiple centers of domestication. Additionally, we were interested in finding out whether genetic introgression between wild and cultivated trees have historically played a role in the domestication process, to assess the presence of potential domestication bottlenecks and to identify genes and genomic regions under selection. Finally, sequencing of nuclear genomes allowed testing earlier suggestions of close relationship between cultivars of distant locations such as southern Spain and the Levant (12).

New assembly version of the reference olive (cultivar Farga) genome
We improved the available O. europaea var. europaea (cultivar Farga) genome assembly (Oe6 version) (22) by anchoring it to chromosomes using a publicly available genetic map (24) and removing 201 scaffolds which likely represent contaminating sequences (See materials and methods). In the final assembly (Oe9), 520.5 Mbp (39.5%) of Oe6 sequence was anchored to 23 linkage groups, of which 288 Mbp (21.8%) were oriented. This anchored assembly (Oe9) has a lower N50 as compared to the recently published assembly for O. europaea var. sylvestris (23), but it is much less fragmented, displaying a ~5 fold reduction in the number of scaffolds (9,751 vs 41,261 scaffolds, Additional file 1: Fig. S1 and Additional file 2: Table S1). The comparison between both assemblies (Additional file 1: Fig. S1b) shows a high level of conserved synteny and confirms the ancient polyploidization event in O. europaea described earlier (25) as many regions appear duplicated between the two genomes.
Additionally, we improved the genome annotation in Oe9 genome, by extended automated functional annotation and by manual curation of some of the genes. Based on this, the Oe9 assembly has 4,911 more annotated genes than the var. sylvestris genome. A comparison of gene sets using BLASTN (26) with identity >80% and e-value<1e-5 cutoffs, shows that 5,245 Oe9 annotated genes do not have a match in sylvestris, and, conversely, 2,620 genes of sylvestris do not have a match in europaea. Distinct genome annotation methods could partly explain these differences (22,23). To have an annotation-independent measure of differences between both assemblies, we mapped raw reads of both genome projects to the alternative assembly and assessed coverage of the putative unique genes (see material and methods). We first filtered out the genes that have at least 50% of their length with a read coverage higher than 20, which resulted in 2,115 and 280 unique genes for europaea and sylvestris, respectively. Even when lowering the coverage threshold to 5, europaea still had more unique genes (1,756) than sylvestris (102). Of these, we discarded 131 Oe9 specific genes as possible contaminations as their first BLAST hit fell outside plants. Thus, some of the genes uniquely found in the Oe9 assembly may represent true differences in terms of gene content.
Interestingly, some unique genes found in europaea have functions associated with stress response, such as HIPPs (heavy metal-associated isoprenylated plant proteins) (27), LEA (Late Embryogenesis Abundant) (28), and salicylic acid-binding (29). Other genes are associated with growth and development. This is the case of RALF (Rapid ALkalinization Factor), which has been shown to arrest root growth and development in tomato and Arabidopsis (30), and caffeoyl shikimate esterase (CSE), which is an enzyme central to the lignin biosynthetic pathway (31). It is well known that lignin biosynthesis contributes to plant growth, tissue and organ development, and response to a variety of biotic and abiotic stresses (32). Some other Oe9 unique genes were associated with seed dormancy and sugar signaling, DOG1 (33), and positive regulation of germination, PELPK1 (34) (See Additional file 2: Table S2). Only two sylvestris unique genes had annotated functions (Additional file 2: Table   S2). One corresponds to GSH-induced LITAF, which negatively regulates hypersensitive cell death in Arabidopsis (35). The other one corresponds to FAR1 (far-red-impaired response) related sequence, with roles in diverse developmental and physiological processes (36,37).
In addition, we assembled the plastid and mitochondrial genomes of the cultivar Farga, which were not provided as separate assemblies in the previous release (22) (see materials and methods). The final assembly of the plastid genome comprised 155,658 base pairs (bp) (Additional file 2: Table S3), in agreement with previously reported olive plastid sequences, which range from 155,531 to 155,896 bp (16,38,39). We annotated 130 genes out of the 130-file 2: Table S3) (16,38), of which 85 are protein coding genes, 37 are transfer RNAs, and   eight are ribosomal RNAs. The final assembly of the mitochondrial genome has a size of   755,572 bp (Additional file 2: Table S3), which is similar to that of previously sequenced wild olive mitochondrial genomes (710,737 -769,995 bp) (39). The coding regions in the olive mitochondrion comprise 46 protein-coding genes, 3 ribosomal RNA genes, and 26 transfer RNA genes (Additional file 1: Fig. S3, Additional file 2: Table S3).

Contrasting genetic diversity patterns in organellar and nuclear genomes
We used this improved reference genome assembly (Oe9) to call SNPs of all individuals at the nuclear, plastid, and mitochondrial genomes. Altogether, for wild and cultivated olives (subsp. europaea), we obtained a total of 16,604,110 polymorphic positions uniformly distributed along the nuclear genome (Additional file 1: Fig. S4), 81 in the plastid genome, and 2,465 in the mitochondrial genome (see Additional file 1: Fig. S2, S3). In the plastid genome, a large region (~25 Kb) was found to be fully conserved and devoid of SNPs in all analyzed individuals (Additional file 1: Fig. S2). Interestingly this region includes the largest plastid gene, ycf2, which has also been found to exhibit low rates of nucleotide substitution in other plants (40). This gene is essential for plant survival, however its exact function is unknown (41,42). This conserved region also comprises other genes, including ycf15, rps7, rps12, ndhB, rRNA and tRNA. All individuals presented similar amounts of nuclear polymorphisms relative to the Oe9 reference (Fig. 1a). Interestingly, the sylvestris from northern Spain (sylvestris-S) has a higher number of homozygous SNPs and a lower number of heterozygous SNPs (Fig. 1a), suggesting a historically small population size. In contrast, the recently published sylvestris from Turkey (sylvestris-T) has a number of homozygous and heterozygous SNPs in the range of variation found in cultivars (Fig. 1a). This observation was also obtained when the SNP calling was performed using sylvestris-T genome as a reference (data not shown).
Strikingly, the patterns of polymorphisms in the organellar genomes did not follow the gradient described above for the nuclear genome (Fig. 1b). In contrast to the nuclear genome, the plastid and mitochondrial genomes of the wild individual sylvestris-S and the cultivar Chemlal de Kabilye from Algeria show a notably lower number of SNPs relative to the Farga reference genome (Fig. 1b). Plastid genomes can be arranged into three groups based on nucleotide polymorphisms, which are congruent with the three main plastid lineages (E1, E2, and E3) already described (Fig. 1c) (16,43). More interestingly, the sylvestris-S shares the same plastid haplotype as cultivars Chemlal de Kabilye and Farga (E3), having only one SNP difference when compared with the latter cultivar (Fig. 1c). Hence, the nuclear and organellar genomes presumably reflect different evolutionary histories. Notably, the organellar genomes of the cultivars Farga, Chemlal de Kabilye, and the wild individual sylvestris-S sequenced in our study show a very close genetic relationship (Fig. 1b,c). As organelles are maternally inherited in olive (44), our results suggest that these cultivars and sylvestris-S share a most recent common maternal ancestor. Similarly, 'Megaritiki' (from Greece) and 'Lechin de Sevilla' (Spain) share the same plastid haplotype (E2), which is different to that of all the other cultivars. Interestingly, this haplotype can also be found in wild populations but exclusively from central and western Mediterranean areas (17).
Similarly, a recent transcriptome-based analysis reported slightly lower genetic diversity in cultivars compared to wild olives, leading to the suggestion of a weak-moderate population bottleneck during domestication (11).
In general, lower genetic diversity in cultivars is commonly associated with genetic bottlenecks during domestication (46). The difference between wild and cultivated trees observed here is less pronounced than that of many domesticated plant species (46,47).
However, perennial crops often do not show evident domestication bottlenecks, in part because vegetative propagation means that perennials may not be many generations removed from their ancestral genetic diversity (48). In order to explore this possibility, we inferred the demographic history of olive using SMC++ v1.15.2, a tool that handles unphased genomes

Population structure and phylogenetic relationships of the olive tree
We reconstructed the phylogenetic relationships of wild and cultivated olives using nuclear, plastid and mitochondrial SNPs separately. In addition, we tested introgression using a model-based genetic structure analysis (see material and methods). Interestingly, the nuclear polymorphisms based phylogeny ( Fig. 2a) places sylvestris-T intermingled within cultivars, and close to 'Beladi' and 'Sorani', which are main cultivars from Lebanon and Syria, respectively. This result remained the same when sylvestris-T was used as reference for the SNP calling (data not shown). Based on this and further results discussed in the following sections, we suggest that sylvestris-T represents a feral individual and may have been misidentified. Considering the results above, this might have contributed to apparent lower differences between wild and cultivated olives in terms of genetic diversity. Genetic structure analyses including all thirteen individuals from subsp. europaea suggested the presence of three distinct ancestral genetic clusters, which are differentially present among the different individuals ( Fig. 2b, Additional file 2: Table S4). Based on this, we distinguished three groups: one composed only by cluster 1 ('Beladi', 'Sorani'), another composed by a mixture of cluster 1 and 2 ('Lechin de Granada', 'Picual'), and the largest consisting of a mixture of the three clusters ('Farga', 'Arbequina', 'Koroneiki', 'Frantoio', 'Lechin de Sevilla', 'Megaritiki', 'Chemlal de Kabilye'). When we analyzed the two oleaster samples, we observed that sylvestris-S is composed only by cluster 3, while sylvestris-T is composed largely by cluster 1, similar to 'Beladi' and 'Sorani', but with a small fraction of cluster 2.
Interestingly all cultivars shared cluster 1, which is pervasive and enriched in cultivars from the eastern MB in different proportions, suggesting that this may be a consistent fingerprint of the primary domestication event that took place in this area (17). The dominance of this genetic background among cultivars suggests that the cultivars derive from a common primary domestication process. However, the presence of additional gene pools within the cultivars depicts patterns that could have resulted from preferential selection of genetic variants among standing variation, from separate domestication events, or from introgression events with wild populations.
To reconstruct the plastid tree, we included the available plastid genomes of four additional accessions of O. europaea subsp. cuspidata and one individual of other subspecies that are publicly available (see material and methods). In this phylogeny (Fig. 2d) cuspidata individuals grouped together in congruence with previous results (16,43). The topology of the phylogenetic tree does not change if we use other plastid reference genomes such as Manzanilla (NCBI:FN996972), 'Bianchera' (NCBI: NC_013707), or 'Oeiras 1' (NCBI: MG255763) for the SNP calling (data not shown). Remarkably, consistent with the polymorphism patterns discussed earlier, we observed incongruencies between the organellar and nuclear trees (Additional file 1: Fig. S6), which gives support for a phylogenetic signal of hybridization (50)(51)(52). In particular, all cultivars (plus the possible feral sylvestris-T) are monophyletic in the nuclear tree (being sylvestris-S the earliest branching), whereas the plastid tree shows that cultivars are polyphyletic, falling into three independent clades, each of them associated with different sylvestris samples. Moreover, attending to the position of samples from other subspecies (such as guanchica, marrocana, and laperrinei), which are intermingled within the subspecies europaea, the plastid genome suggest a polyphyly of the subspecies europaea.
A particularly strong phylogenetic incongruence observed among cultivars involves the cultivars Farga and Chemlal de Kabilye, which cluster together with the other cultivated olives in the nuclear tree, but are sister to var. sylvestris from Spain, far from the other cultivars in both plastid and mitochondrial trees (Fig. 2d,e). This result supports previous evaluation of maternal inheritance of plastid and mitochondria (21). Particularly, these results suggest that the maternal line of cultivars Farga and Chemlal de Kabilye originated from western Mediterranean wild olives (which carry the E3 plastid haplotype), while the paternal line originated mostly from domesticated individuals from the eastern Mediterranean basin. A similar pattern has been observed in a previous study that combined a large sample of cultivated olives and oleasters, in which most cultivars were assigned to the eastern genetic pool, even those with maternal lineages that originated from the western Mediterranean basin (11,17). All these results reinforce the idea that cultivars are either from the eastern genetic pool or admixed forms (6,10,17), and support secondary admixture processes in the western Mediterranean basin, with contribution from western populations of var. sylvestris, as clearly shown by the plastid lineages. Indeed, the largest wild populations of olives are found in the western Mediterranean.
In sum, the plastid tree suggests genetic contributions, at least in the maternal lineage, of three different genetic pools that may be in agreement with multilocal domestication processes, whereas the nuclear tree suggests a dominant, congruent signal shared by all sequenced cultivars that is consistent with a unique primary domestication center in the eastern MB.

Genetic introgression from western var. sylvestris genetic pools among cultivated olives
Previous phylogenetic analyses have suggested the existence of ancestral inter-subspecies hybridization in deep nodes of the evolutionary tree of O. europaea (25). Our population genomics results discussed above and a recently published transcriptome-based population analysis (11) suggest recent intraspecific genetic admixture between western cultivars and western sylvestris. To incorporate genetic admixture into a phylogenetic framework, we reconstructed a split network tree based on nuclear genome data (Fig. 3), which revealed a heavily reticulated structure mostly affecting the relationships among all olive samples. In particular, most samples with the exception of cultivars Sorani, Beladi and sylvestris-T appear in a heavily reticulated area that shows a distant connection with sylvestris-S.
Consistent with the population genomics results, sylvestris-T appears well-embedded within cultivars and with little signs of reticulation. Overall, such a reticulated network can be the result of hybridization (including introgression) or incomplete lineage sorting. In order to detect evidence of introgression among the olive samples, we ran the ABBA-BABA test (53) using the program Dsuite v0.1 r3 (54) (see material and methods) for all quartets of the phylogenetic tree (Fig. 2a). This analysis shows support for introgression between sylvestris-S and the lineages of 'Chemlal de Kabilye', 'Megaritiki', 'Koroneiki', with a significant Dstatistic around 0.43 (Fig. 2c). In addition, there seems to be some support (D-statistic ~0.

Identification of genes under selection
Olive trees were domesticated for their fruits, either as a source of oil or edible fruits (55).
Genes associated with agronomic traits may have undergone positive selection during domestication. In order to search for genes putatively under positive selection in the cultivars, we first classified the SNPs into intergenic, intronic, and coding. We further classified coding SNPs according to whether they imply synonymous or nonsynonymous changes (see material and methods). Additional file 1: Fig. S6 shows that there is a higher percentage of SNPs in intergenic regions (4.7 SNPs/Kb), followed by intronic (0.8 SNPs/Kb) and coding regions When we analyzed the SNPs that can produce a nonsynonymous change, including heterozygous and homozygous SNPs, we found that a total of 32,723 proteins (59% of the predicted Oe9 proteome) have at least one SNP with nonsynonymous change, of which 14,985 are common for all the individuals (Additional file 2: Table S5). On the other hand, the list of proteins that did not show any nonsynonymous change were used to perform a functional enrichment analysis, and we found overrepresented GO terms associated with molecular functions, cellular components, and one biological process (see Additional file 2: Table S6). These functional categories may be under a particularly strong purifying selection.
When we look at the genes without SNPs, 21,492 genes are present in all individuals and few are unique to each individual (Additional file 2: Table S5). Interestingly, sylvestris-S has the largest number of genes (1,581) without SNPs relative to cultivar Farga.
To further determine whether one or more proteins are under positive selection in cultivated olives, we searched for evidence of recent selective sweeps by quantifying site frequency spectrum (SFS) deviations relative to genome-wide patterns using the composite likelihood ratio (CLR) statistic implemented in SweeD (58) (see material and methods). A total of 555 regions, which fell within the top 0.5% of the observed distribution were considered as candidates for positive selection. After merging the overlapping regions (see material and methods), we kept only 181 genomic regions, from which 5% were larger than 50 kb and 56% shorter than 10 Kb. A total of 84 genes were identified in 69 sweep regions. Among the genes found within sweep regions, some are associated with lipid, carbohydrate, and amino acid metabolism, and others with stress tolerance, solute transport, and RNA processing (Additional file 2: Table S7). Remarkably, we did not find any gene related to fatty acid metabolism and accumulation, although this has presumably been one of the most important characters under selection in olive domestication. Moreover, a recent study found 19 genes associated with 5 important agronomic traits in olive (59) and despite 14 of them being found in Farga (see Additional file 2: Table S8), none was present in the detected selective sweep regions. Similarly, a study based on transcriptomes of 68 different wild and cultivated samples using statistical approaches (PCAdapt (60) and BayeScan (61)) failed to identify candidate genes under selection associated with oil content or fruit size (11). They did, however, detect ten genes as strong candidates for selection that were associated with transcriptional and translational activities and to the cell cycle. Also, it was proposed that domestication in olive may be more related to changes in gene expression than changes in protein function in agreement with its evolutionary history (11). Further studies with more cultivated and wild samples will be needed to test this inference further.
Since some cultivars show signature of introgression with sylvestris-S, we assessed whether adaptive introgression has contributed to olive domestication. When we analysed the 181 selective sweep regions, we found that on average 31% of these regions were overlapping introgressed regions (Additional file 1: Fig. S7b, Additional file 2: Table S9), with 'Megaritiki' being the cultivar with more common regions (51%), followed by 'Chemlal de Kabilye' (45%). Overlapping regions of introgression with selective sweeps suggest that adaptive introgression seems to have played an important role in olive domestication, as it has been observed in other crops (62).

Improved assemblies for nuclear and organellar genomes
Our first reference genome assembly for Olea europaea (Oe6 version) (22) provided a needed resource for researchers interested in understanding the genetic basis of olive traits and the process of domestication. Here we improved the assembly (Oe9) by anchoring it to chromosomes using a publicly available genetic map (24). In addition, we produced individual annotated assemblies for the plastid and mitochondrial genomes, which were not provided as separate assemblies in earlier releases. Our comparison of our improved nuclear assembly with the recently released assembly of an oleaster from Turkey (23) and the identification of shared duplicated regions provides additional support to the previously proposed ancient polyploidization in olives (25). Unexpectedly, we found apparent differences in gene content even using a conservative approach that is independent of differences in annotation methodologies. These differences could reflect shortcomings of the assemblies, changes occurred during domestication, or variation in gene content among individuals, as observed in grapevine, another perennial crop (63). Further efforts in improving reference assemblies for the olive tree with new long read technologies will definitely enhance our understanding on the effect of genome rearrangements, including deletions of genomic regions during the evolution of olive.

Reliable genetic sources: feral individuals vs. true wild trees
Understanding the domestication process in olives requires careful comparison between reliably identified cultivars and wild individuals. Indeed, plant choice is crucial before starting any genomic studies. In order to use reliable genetic sources, biology features of olive germplasm have to be considered in the field: a) high phenotypic plasticity and resilience; b) long reach of aerial transported pollen; c) the ability to cross between genetically distinct individuals; and d) the long lifespan of the trees (often reaching several centuries and millennia). These characteristics make it extremely difficult to tell apart pure wild individuals from those highly introgressed or even cultivars (64). As stressed earlier, we took great care in selecting an individual from an isolated wild olive population, which had been studied over the last years and meets key characteristics that conform to a true oleaster.
Our phylogenetic analysis based on the nuclear genome places this wild individual at the earliest branching position within the individuals of the subspecies europaea, sister to the rest, which is fully consistent with its ascription to a wild western Mediterranean genetic pool. In stark contrast, the results for the wild sample from Turkey (23) were puzzling. The phylogenetic analysis placed it at a relatively shallow clade, forming a tight cluster with two cultivars from the eastern MB. This result could be explained through four hypotheses: i) the Turkish individual identified as sylvestris is actually a feral olive, ii) this tree is an oleaster highly introgressed with cultivars; iii) considering that the cultivated olive originated from wild oleasters, this pattern may represent one of multiple primary domestication events; or, alternatively (iv) a separate, very recent domestication event. However, hypotheses (iii) and (iv) are at odds with the observed topology in which the well-defined clade formed by sylvestris-T, 'Beladi', and 'Sorani' is placed at a rather shallow position of the tree, embedded within a larger clade of other cultivars; whereas sister-clade relationships to other cultivars would be expected from independent domestication events. Further population analysis, including the search for introgression signals discard hypothesis (ii), as sylvestris-T individual showed weak signature of introgression and instead a genetic structure highly similar to the nearby sampled cultivars. In addition, the rarity of wild olive trees in the eastern Mediterranean (65) together with the small number of phenotypic characters to differentiate cultivated olives from oleaster, must be considered. The "Flora of Turkey" (page 156) states that "… spontaneous seedlings may revert to var. sylvestris" which, despite the misuse of the term reversion, highlights the phenotypic similarity between ferals and true oleasters in the region (65). Earlier studies at a large geographical scale already mentioned the difficulty to find genuine oleasters in the eastern MB (45,66). Recent experiences confirmed this fact, analyzing putative wild olives from Turkey (15) and Israel (12) that appeared to be feral after genetic analyses. Given all these facts, we suggest that sylvestris-T is a feral individual, and we have treated it as such to avoid confounding conclusions. Given the open debate on possible alternative centres of domestication, and the interest in tracing the genetic aftermath of domestication in olives, the meticulous choice of additional wild populations will be necessary. In order to reliably reconstruct complex domestication processes, we stress the importance of careful sample collection of olive trees in the field using ecological (67) and key morphological characters (68), assisted with molecular screening of olive material, before undertaking genome sequencing of wild trees.

Genomics support for primary and secondary events of domestication driven by introgression from wild genetic pools
This study shows the first phylogenetic analysis of genome-wide sequences of the Mediterranean olives. Our results, together with those from previous analyses (11), suggests that cultivated individuals have slightly lower nucleotide diversity as compared with wild individuals. However, our demographic analyses support the existence of a relatively small population size at the time of domestication, with a steady decrease in population size preceding domestication as has been inferred for some other crops (69,70). These analyses are consistent with earlier studies suggesting a narrow distribution (and hence limited population size) of oleaster populations in the eastern MB over the last 150,000 years (8).
The demographic analysis also indicates a mild population bottleneck around 5000-7000 years ago, consistent with the proposed period for primary domestication of olives in the Levant around 6000 years ago (8). Interestingly, our analysis also suggests a rapid increase in population size following the domestication bottleneck, likely coupled to the expansion of olive cultures in broader Eastern Mediterranean countries. Altogether, these results suggest that one ancestral genetic pool, likely deriving from a founding population from eastern wild trees, is pervasive among cultivars. This is consistent with common ancestry at a primary domestication event from which all cultivars descend. Altogether, the phylogenetic and admixture analysis show that rampant hybridization has shaped the evolutionary history of the different lineages of olives. We thus hypothesize a recurrent pattern of domestication that involves a primary domestication event in the east of the MB, followed by nested secondary (and tertiary, and so on) local domestication events with the involvement of additional wild genetic pools, forming a domestication continuum (6). Admittedly, increased genomic sampling, particularly of true wild populations, is needed to help describe these complex patterns of evolution in Olea europaea across the numerous areas of the Mediterranean basin.

Lack of a clear domestication syndrome
Cultivated olives have undergone a complex domestication process, which has led to morphological and physiological changes. The main traits selected during the transition between oleasters and cultivars are fruit size and oil content (13,55,68). The domestication scenario described in the previous section, which is punctuated by hybridization, may make it difficult to detect genes selected during the process of domestication. We used branch-site models to detect sequence changes likely associated with selection. We detected very few genes putatively selected at the lineage subtending all sequenced cultivars. However, when analyzing each cultivar lineage individually some patterns emerged. In particular, genes positively selected in cultivated olives were associated with a response to biotic and abiotic stress. However, further analyses are needed to ascertain whether they indeed play a role in trait selection during domestication (see (72) for secondary compounds). Previous analyses detected few signs of positive selection affecting protein-coding regions, but they did detect differences in expression levels between cultivars and wild individuals for specific genes (11). This led to the conclusion that selection may have acted on non-coding regions that drive gene expression. However, given the difficulty of controlling for other factors driving gene expression (different periods of the year, local environmental conditions), we believe this result must be viewed with caution. Alternative approaches are needed to detect alleles whose presence in the cultivars were selected through domestication. We cannot rely on models that assume only vertical evolution, but would rather search for shared conserved or introgressed regions across different cultivars sharing similar phenotypes (68). To be effective, a much larger sampling of genomic sequence will be needed for such an approach.
In addition, assemblies from representative lineages of reliable wild olives will help to better trace the origin of different introgressed regions in the genomes of cultivars. Overall, our new phylogenomic and genetic analyses of whole-genome sequences show evidence for a complex process of reticulation disrupting historical isolation in the course of olive domestication.

Scaffolding of the cultivated olive genome using a linkage map
A new, improved version of the O. europaea genome assembly (Oe9) was produced by anchoring the Oe6 version (22) to a publicly available genetic map (24) using ALLMAPS (73). First, we took the intersection of 7,042 markers for which a sequence was provided and mapped them with BWA v0.7.15 (74) to the Oe6 olive genome assembly. Filtering for minimum mapping quality 20 and fewer than 10 mismatches, we obtained 5,780 mappings.
Intersecting these mapped markers with the 3,404 markers placed in the genetic map resulted in a set of 2,759 markers with both a genetic map position and an unambiguous physical location in the Oe6 assembly. ALLMAPS was then run with default parameters. A total of 2,362 markers were considered unique, of which 2,134 were anchored, and 228 unplaced.

Functional annotation of the cultivated olive genome
In order to give more functional insight to our analysis, we decided to improve the functional annotation of the genes present in the Oe6 version of the genome by running Blast2go (75), which in turn ran a BLASTP (76) search against the non-redundant database (April 2018) and Interproscan (77) to detect protein domains on the annotated proteins.

Comparison of the europaea and sylvestris genomes
To compare the two available genome assemblies of O. europaea subsp. europaea (var. europaea and sylvestris), we plotted their cumulative assembly lengths using the R Package 'ggplot2' (78). The predicted protein-coding gene sequences of the two assemblies were compared using a search with BLASTN (26). Results were filtered using cutoffs of 80% identity and e-value<1e-5. For the genes that did not have a hit, we analyzed whether they were covered by reads of the other sequenced accession. For this step we first mapped the reads of each genome against the other using BWA 0.7.6a-r433 (74). We considered a gene as individual-specific (e.g. europaea-specific) if it did not pass a filter of coverage >20 over 50% of the gene length in the other genome (e.g. sylvestris). With the aim of achieving a more conservative set of individual-specific genes, we applied a more stringent filter of coverage (>5). Finally, in order to search for the functions of the genes without a hit in the other genome (individual-specific), we performed a BLAST search against the NCBI nonredundant database, and the same cutoffs as described before.

Genome sequences
We sampled and sequenced twelve genomes of O. europaea: ten cultivars ('Arbequina', cuspidata to be used as outgroup. These samples broadly covered the geographical distribution of the species in the MB (see Table 1). The authenticity of these cultivars was previously substantiated through molecular and morphological markers (9). The sequenced oleaster (referred to as sylvestris-S from now onwards) was collected in the North of Spain.
All the samples were collected according to the local, national or international guidelines and legislation.
The DNA of all individuals was extracted as described in (22) and their genomes were sequenced using Illumina HiSeq 2000 paired-end technology to a sequencing depth ranging from 24 to 34x at the CNAG-CRG sequencing facilities, as described for the reference genome (22). In addition to these individuals, we used publicly available data of the reference genome sequence of the olive cultivar Farga, (22). 'Farga' is a cultivar from Catalonia (eastern Spain) with the E3.1 plastid haplotype and previously classified as representative of the central MB cultivated genepool (Table 1). We also used a recently published assembly of O. europaea var. sylvestris (referred as sylvestris-T from now onwards) (23), and downloaded fourteen plastid genomes of the Mediterranean olive from the NCBI database (see Table 1).

Organelle Assemblies
The available reference genome sequence does not include separate scaffolds for mitochondrial and plastid genomes (22). Here, we assembled and annotated both organellar genomes of the cultivar Farga using paired-end (PE) and mate-pair (MP) data from the reference genome sequence project (22). Briefly, all genome shotgun Illumina reads were mapped using BWA v0.7.13-r1126 (74) to a reference plastid sequence (NC_013707), and a mitochondrial sequence (MG372119). Then reads were filtered allowing only those that mapped in proper pairs with a hard and soft clipping for a maximum of the 25% of the total length of the read. The plastid genome was assembled using NOVOPlasty v2.6.3 (79). The mitochondrial genome was first assembled with SPAdes v3.10.0 (80). Then, this initial assembly was scaffolded using SSPACE_Basic v2.0 (81) using PE and MP libraries. Finally, gaps were filled with GapFiller v1-10 (82).

Detection of Single Nucleotide Variants
To assess nucleotide diversity across sequences of the Mediterranean olive at the nuclear, plastid and mitochondrial levels, we called single nucleotide polymorphisms (SNPs) using the cultivar Farga genome as a reference for all the cases (for the nuclear genome we used the Oe9 version). Sequenced reads from each individual were mapped against the reference genome using BWA 0.7.6a-r433 (74), and SNPs were identified with GATK HaplotypeCaller v4.0.8.1 (85), setting ploidy to 2, and using thresholds for mapping quality (MQ > 40), quality by depth (QD > 2), Fisher strand bias (FS < 60), mapping quality rank sum test (MQRankSum > -12.5), read pos rank sum test (ReadPosRankSum > -8), strand odds ratio (SOR < 3), read depth of coverage (DP > = 10), and allelic depth (AD >= 3). Sites with missing alleles and spanning a deletion were also removed. Finally, VCFtools v0.1.15 (86) was used to filter out positions according to the number of alleles (--max-alleles 2), and minor allele frequency (--maf 0.04).

Nucleotide diversity
Pairwise nucleotide diversity (πN/πS) i ) was calculated using VCFtools v0.1.15 (86)  Then we identified population structure without a priori grouping assumptions, using the Structure software v2. 3.4 (89). Structure was run with 100,000 generations of 'burn in' and 100,000 Markov chain Monte Carlo (MCMC) iterations after burn-in for increasing K values ranging from 1 to 6, considering independent alleles and admixture of individuals.
Simulations were repeated five times for each value of K. The optimal number of genetic clusters was determined using the ΔK metho K method (90) using the software Structure Harvester (91). Finally, the optimal K value was visualized with DISTRUCT v1.1 (92).

Phylogenetic analysis
Phylogenetic trees were reconstructed using SNP data from nuclear, plastid and mitochondrial genomes separately. In each case, the genome sequence of each sequenced individual was obtained by replacing the SNP positions in the respective reference genome, resulting in a pseudo-alignment of all the sequenced genomes. Specifically, for the nuclear dataset we included only homozygous SNPs. For the plastid genomes, we included additional sequences by aligning our genomes with the database genomes (see Table 1) using MAFFT v7.305b (93). All these alignments were trimmed using trimAl v1.4 (94)  Additionally, for the nuclear data, we reconstructed a phylogenetic network using SplitsTree4 v4.14.5 and the NeighborNet approach (96).

Analysis of Introgression with SNP Data
The ABBA-BABA test (53)  GO term enrichment analyses of the proteins without nonsynonymous SNPs was calculated using FatiGO (98). To investigate the variation of nonsynonymous and synonymous SNPs within coding regions, we compared nonsynonymous changes per nonsynonymous site (πN/πS) i N) to synonymous changes per synonymous site (πN/πS) i S) using a synonymous: nonsynonymous site ratio of 1:3 (99).

Identification of genes under selection
To detect protein-coding genes that have potentially undergone selection among the cultivated individuals we used an approach based on the site frequency spectrum (SFS). SweeD v3.1 (58) was used to identify selective sweeps in olive. This program is based on Sweepfinder (100) and uses a composite likelihood ratio (CLR) to identify loci showing a strong deviation in the site frequency spectrum (SFS) toward rare variants. We used as outgroup the subspecies cuspidata to infer ancestral alleles. SweeD was run separately for each scaffold and grid as the only parameter. The grid parameter was calculated per scaffold in order to have a measure of the CLR every 5 Kb (size of the scaffold/5000). Outliers were defined as the 0.5% with the most extreme p-values. Closer regions, with less than 10 bp distance were collapsed. Finally, a GO term enrichment analysis of the proteins that mapped to those regions was calculated using FatiGO (98).

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.            Additional file 2: Table S1. General characteristics of the two genome assemblies of the cultivar Farga, and the genome assembly of the var. sylvestris from Turkey. Table S2. List of unique genes of europaea and sylvestris with their homologous function.  nonsynonymous SNP. First column shows the term category, the second, the GO term, the third, the term level, the fourth, the p-value, and the fifth, the term name. Table S7. List of proteins in regions with selected sweeps and their associated function. Table S8. Blast results of the 19 genes of sylvestris-T against Farga. The results were filtered by %identity >90 and e-value<1e-5.