In this study, we developed modern genomic tools (unigene set, SNP-array and gene-based linkage maps) and applied them to the identification of a deleterious allele segregating at an embryo viability locus, and to studies of the extent and distribution of recombination along the chromosomes and the factors (sex, genetic background) potentially accounting for differences.
Development of genomic tools to facilitate genetic research in maritime pine
In a recent review, McKay et al.
 summarized the transcriptomic resources currently available for the five best-studied coniferous genera. For maritime pine, the first unigene set was derived from 30 k Sanger ESTs and contained 4,483 contigs and 9,247 singletons
. A second version (available from
) was established with about 0.88 million curated reads, mostly obtained from high-throughput sequencing (454'Roche platform) and assembled into 55,322 unigenes
. The third version, presented here, corresponds to the largest sequence data collection obtained to date, with over two million 454 reads assembled into 73,883 contigs and 124,542 singletons. It, therefore, constitutes a major step toward the establishment of a gene catalog for this species. The Roche 454 pyrosequencing platform was chosen because it provides long reads (325 bp in cleaned reads, on average, in this study) that are particularly useful for de novo transcriptome assembly, particularly when no reference gene model is available. We will not discuss the content of version#3 further here, because the three datasets were merged together (as they used essentially different sequence reads: Sanger, 454, Illumina) to obtain a large annotated catalog of full-length cDNAs. In the absence of a sequence genome for a conifer, such a catalog will serve as a reference for guiding the assembly of further short-read sequences. This approach is considered the most cost-effective method for both: i) gene expression profiling
 to determine the molecular mechanisms involved in tree growth and adaptation (for example,
); and ii) polymorphism detection
[30, 31] for applications in evolutionary ecology (for example,
), conservation and breeding (for example,
). In parallel with the production of Pinus pinaster ESTs, the transcriptomes of more than a dozen conifer species were sequenced and assembled
. These species included three pine species, but not Pinus pinaster. The 1,000 Plant Transcriptome project
 will also provide transcriptome data for at least 48 conifer species. Overall, this vast body of data will provide a remarkable resource for comparative genomics in conifers, with maritime pine continuing to play a key role in the development of transcriptomic resources for population and quantitative genomics studies.
Next-generation sequencing of the transcriptome is a powerful strategy for identifying large numbers of SNPs in functionally important regions of the genome
. For non-model species, including conifers, this approach is particularly effective when coupled with existing unigene sets, because the reference contigs facilitate the effective assembly of newly generated short reads (as illustrated by Rigault et al.
 and Pavy et al.
 for spruce). In this study, we identified a large number of gene-associated SNPs by in silico mining of the maritime pine unigene assembly. It should be noted that the SNPs were selected exclusively from sequence reads associated with cDNA libraries constructed with Aquitaine genotypes. In addition, given the high sequence error rate associated with 454 sequencing (approximately 0.5%
), we used stringent criteria (minimum allele frequency (MAF) ≥33%, coverage ≥10x) to avoid the selection of SNPs present at such low frequencies that they are likely to be the product of sequencing error. Consequently, SNPs with low MAFs are less likely to be represented in our genotyping array, and this selection procedure would introduce an ascertainment bias if applied to natural populations from other maritime pine provenances. As our goal was to design a SNP array for use with the Illumina Infinium assay, we also limited our selection to SNPs that were likely to perform well (assay design tool (ADT) score ≥0.75) with this technology, introducing a second bias toward less polymorphic genes, because this score is lower when the flanking sequences contain SNPs. Furthermore, using RNA as the starting material undoubtedly resulted in genes not being equally represented, with highly transcribed genes probably overrepresented in our sample.
For the 6,299 nucleotide replacement SNPs, 25% failed and 40% to 57% were monomorphic, depending on the population, whereas 19% of the assays failed and 80% of the markers were monomorphic for insertion-deletion mutations. Thus, indel mutations are more prone to sequencing errors with the Roche sequencing platform and should clearly be avoided in the Infinium assay. Taking into account only the markers polymorphic in both of the pedigrees studied, 1,970 different gene loci were successfully tagged with at least one SNP and mapped (either as framework or accessory markers) within the genome.
High-density linkage maps are crucial to our understanding of quantitative trait variation, especially for species without a reference genome assembly. With the recent development and thorough assessment of SNP markers, saturated, high-density genetic linkage maps have been established for several conifers, including Cryptomeria japonica (1,216 markers, 968 corresponding to SNPs, over 1,405cM,
), Picea mariana and Picea glauca (consensus map of these two species comprising 1,801 gene loci over 2,083 cM,
), Pinus taeda (1,816 genes over 1,898cM,
) and Pinus pinaster (this study). As in these aforementioned studies, the expected map coverage rate for the maritime pine linkage maps was high (about 100%), indicating that the maps developed in this work are saturated. Thus, the mean distance between adjacent markers (2.6, 2.3 and 1.5 cM in the G2F, G2M and F2 maps, respectively) was strongly skewed toward small distances [see Additional file
11]. These next-generation linkage maps will facilitate the analysis of conifer genome evolution, by making comparative mapping possible at a scale that was not achievable with previous, low-throughput marker systems (for example,
Comparison of segregation patterns between inbred and outbred matings indicates the presence of a chromosomal region with a deleterious mutation acting at the postzygotic stage
Departure from Mendelian expectations, which is also known as segregation distortion (SD,
), is frequently reported in linkage mapping studies (reviewed by Li et al.
). If a gene causing SD is segregating in a population, then the markers close to it tend to display distorted segregation ratios. Thus, as a rule of thumb, the clustering of markers displaying SD in particular genomic regions (so-called segregation distortion regions, SDRs) may indicate that segregation distortion is caused by genetic factors rather than statistical bias or genotyping errors. However, as illustrated in this study, small population size may lead to false positives and the identification of spurious SDRs. Care should therefore be taken to validate SDRs before any biological interpretation is attempted.
Biologically, aberrant Mendelian segregation can be attributed to selection occurring at different stages of the plant’s life cycle, from gametophyte development to seed germination and plant growth
[20, 40]. In this study, a single cluster of distorted markers was detected and validated in LG2 of the F2 map, whereas the corresponding genomic region on the two G2 maps displayed no deviation from the expected Mendelian segregation ratio. This strongly suggests the presence of a deleterious mutation (or a cluster of tightly linked embryo viability loci), revealed by inbreeding, that influences the fitness of the F2 zygotes at some point between fertilization and the age of 10 years (as the tissues sampled for DNA extraction were taken from 10-year-old trees). This conclusion is supported by two additional observations. First, this F2 family was selected specifically because of its low rate of seed abortion (frequency of embryo-less seeds, as estimated by assessing floating in water, was lower than for other available F2s from the maritime pine breeding program, unpublished results), making it particularly suitable to genetic analysis requiring a large sample size. In our study, 638 seeds were initially planted in a nursery in June 1998; 626 seedlings germinated (that is, only 1.9% died soon after germination) and were transplanted into the field in March 1999. Total height was then measured every fall, beginning in 1999. Fifteen seedlings died during the first growing season in the field (assessment in the fall of 1999). The following year, 43 other seedlings died, but no further deaths were recorded thereafter. It is difficult to determine whether these deaths were due to some crisis during transplantation from the nursery to the field or to genetic load. However, peak mortality did not occur in the nursery or just after field transfer, and the semilethal allele was inherited from the Corsican paternal grandparent. These findings suggest that this SDR decreases the fitness of homozygous Corsican genotypes in early stages of development and later in tree growth. Unfortunately, no post-mortem analysis involving the sampling of plant material from the whole progeny just after germination was performed, to determine whether the dead plants were all homozygous for the Corsican allele in the SDR concerned.
Second, in a previous study, Plomion et al.
 compared the segregation patterns of random amplified polymorphic DNA (RAPD) markers in megagametophytes (a maternally derived haploid tissue surrounding the embryo) from the same hybrid tree (H12), sampled from either inbred (self-cross) or outbred (open-pollinated cross) seeds. They observed no significant SD for loci in the dataset resulting from selfing, suggesting that gametic selection, leading to gamete abortion or lower gamete fitness, can be ruled out as a possible cause of SD in this study.
Genomic regions containing lethal or sublethal alleles have already been detected in several conifers, through linkage mapping approaches (reviewed in Williams
). The number of such lethal or sublethal equivalents is generally high in populations, as revealed by the typical high level of inbreeding depression in these outcrossing species (reviewed by Williams and Savolainen
), but their severity varies in the population, with some genotypes (like that selected in this study) bearing mutations that are, a priori, less deleterious than others. The nature of the underlying loci remains unclear. Some of these genetic factors are involved in early embryo development, resulting in a lower yield of filled seeds upon selfing, others decrease seedling growth and cause abnormal phenotypes, whereas others are directly involved in seedling mortality at later stages of development, from a few weeks
 to a few months after germination, as shown here. In addition to providing fundamental knowledge, the analysis of segregation distortion and the identification of SDR are of great importance for the correct determination of quantitative trait loci (QTL) positions and for the estimation of QTL effects. Indeed, SD influences the estimation of recombination frequency and may, therefore, decrease the accuracy of QTL mapping in this mapping population.
The extent and spatial distribution of meiotic recombination is genetically variable
Recombination is a driving force behind the generation of genetic diversity and is also a key process shaping genomic architecture
. An understanding of the factors controlling the frequency and genomic distribution of meiotic recombination is, therefore, essential if we are to manipulate this process to improve breeding accuracy. This study generated three major results.
First, we confirmed that, despite their large physical size, pine chromosomes display a similar number of crossover events to other smaller plant chromosomes. This observation led Thurieaux
 to suggest that recombination was confined largely to the coding regions, because all eukaryotes have approximately the same number of genes, as demonstrated by the genome sequences of various organisms (for example, in Arabidopsis, rice
[46, 47], maize
 and sorghum
), although other genomic features may affect recombination. At the microscale, no consistent relationship has yet been established between recombination rate and gene content
[50, 51], suggesting that it is probably not correct to assume that all plant recombination hotspots correspond to gene-rich regions. It will not be possible to determine whether recombination hotspots correspond to gene-rich regions in conifers until a complete conifer genome sequence is obtained. However, as gene-rich regions tend to be associated with high rates of recombination in other plants, it seems likely that relationships between crossover frequency and gene density will not deviate from this trend in conifers. For example, in bread wheat (17Gb/C), a non-uniform crossover gradient along chromosome 3B has been observed, with lower frequencies of crossover in the gene-poor centromeric region and the highest frequencies of crossover in the distal subtelomeric regions, in which gene density is higher
. At a finer scale, these authors also demonstrated that gene content was one of the factors driving recombination in this species
Second, we observed that meiotic recombination was not randomly distributed along the length of the maritime pine chromosomes, suggesting that recombination occurs at specific sites, the recombination hotspots (reviewed by Lichten and Goldman
). An uneven distribution of markers is a classical observation in most papers reporting saturated linkage maps for plants and animals. Tests of departure from a Poisson distribution have always been based on a single or a series of different, arbitrarily fixed intervals (as illustrated by Moriguchi et al. in Cryptomeria japonica). To our knowledge, only Pavy et al.
 have previously implemented a statistical approach, based on kernel density function, in Picea spp., to overcome the need to use such fixed bandwidths in analyses of ‘gene-rich regions’ as an indicator of suppressed recombination. In this study, we used the same strategy, combining it with a sliding window approach, to improve the resolution of recombination hotspots and coldspots. Interestingly, in most LGs of the G2F and G2M maps, a sharp cold spot located in the middle of the linkage group was surrounded by two large hot spots. This suggests that these cold spots may correspond to the centromeric regions of the chromosome, in which the frequency of recombination is known to be low
[51–55] and in which markers tend to cluster on meiotic maps. However, further studies are required to confirm this assertion. This signature was less clear in the F2 map, which contained about twice as many coldspots as the G2 maps (48 in F2 versus 27 in G2F and 28 in G2M), with a similar number of hotspots (71 versus 62 and 69). An uneven distribution of crossover events has been reported for both species with small genomes and those with large genomes (
 for Arabidopsis,
 for wheat) and an understanding of the distribution of recombination events is critical for various genetic applications. First, following on from the discussion above, if recombination occurs in hotspots and these hotspots bear most of the genes, then differential sequencing efforts will be required to obtain data for all of the genes in conifer genome sequencing programs. Second, as illustrated by Wang et al. for rice
, the map-based cloning of a QTL is facilitated if the QTL is located in a genomic region containing a recombination hotspot, simply because it is easier to identify large numbers of recombinants from segregating populations. This information may be useful for the characterization of genes underlying major QTLs in species with large genomes, such as pines, as already reported for wheat
Third, our results show that the extent and spatial distribution of meiotic recombination is genetically variable. The interprovenance hybrid had recombination rates 1.2 times higher (measured on the basis of total map distance) than those of either of the intraprovenance hybrids. This suggests that the genetic divergence of bivalents may account for the extent of recombination at meiosis. However, a comparison of gene heterozygozity between the three genotypes on the basis of both mapping data [see Additional file
6] and the in silico prediction of polymorphisms [see Additional file
12] showed that the diversity of the interprovenance hybrid was intermediate with respect to the diversity of the two intraprovenance hybrids. These two findings indicate that the genetic distance (at least within the gene space, in which most crossover events are thought to occur) between the bivalents does not alter meiotic pairing to a point that would lead to differences in recombination frequencies, as shown in interspecific hybrids by in situ hybridization
 and linkage mapping
. Moreover, the high degree of collinearity between the maps for the intra- and interprovenance hybrids shows that no genome rearrangement occurred during hybridization that might have led to a recombination disorder. We can conclude that the observed difference in map length reflects differences between genotypes. The distribution of recombination events differed between the three genotypes, which had only some hotspots, and even fewer coldspots in common. This suggests that the spatial pattern of recombination along the chromosome is also genetically variable and under polygenic control, as demonstrated by Comeron et al. in Drosophila melanogaster. Recombination is known to be genetically variable
[15, 60, 61] and under the control of multiple trans and cis genetic modifiers. Sequence polymorphisms
[62, 63] and/or the methylation status of these genetic factors may underlie these differences in recombination pattern and should be investigated further in conifers.
Whether the results obtained depend on the type of markers used needs to be addressed. First, it should be noticed that the total map length obtained in the present study with coding sequences, was similar to that obtained for the same genotypes using anonymous RAPD
 or amplified fragment length polymorphism (AFLP)
 markers (supposedly corresponding to non-coding DNA). Second, maps combining gene-based markers and genomic DNA markers (for example, proteins and RAPDs in
, EST-Ps and AFLPs in
, SNPs and AFLPs in
) were also constructed in this species and did not show any clustering of one or another marker type. Therefore, it is assumed that the recombinational landscape presented in this paper should not be biased by the type of marker (coding versus non-coding) used for linkage analysis.