Genotyping of gene SNPs
A collection of 27,720 white spruce cDNA clusters  was used to develop gene SNPs (Additional file 4). In silico SNP discovery in expressed white spruce (Picea glauca (Moench) Voss) genes was conducted after Sanger resequencing and according to parameters previously reported . A GoldenGate (Illumina, San Diego, CA, USA) SNP array (PgLM2) specifically designed to map additional white spruce genes was constructed with 1,536 attempted SNPs dispersed through 1,509 expressed genes. The assay was conducted following procedures previously described . Genotyping was done at the Genome Québec Innovation Centre (McGill University, Montreal, QC, Canada, team of A. Montpetit) and by using 250 ng of template DNA per sample (at a rate of 50 ng/μL). The results were analysed with the BeadStudio software (Illumina). We retained the 1,292 SNPs exhibiting a GenTrain quality score of 0.25 or more. Of these, 1,121 (73.0%) SNPs representative of 1,098 genes segregated among the 500 progeny of the white spruce cross D (C94-1-2516, ♀77111 × ♂2388). The average call rate per valid SNP was 99.48%. In total, 773,395 new genotypes were obtained from valid SNPs. Using replicated positive controls, the rate of reproducibility was estimated at 99.994%.
Annotation of the mapped genes
The mapped genes were representative of a large array of molecular functions and biological processes including wood formation, growth, vascular tissues development and differentiation, responses to abiotic and biotic stresses and adaptation [83, 84]. They were mostly derived from expressional and functional studies [83, 84] as well as outlier detection  (Additional files 4, 5 and 6).
GO annotation was performed with the Blast2GO software . GO terms were assigned based on the top 10 blastx hits found against the non redundant (nr) protein database with an E-value below 1e-10. GO annotation was run based on the terms from the PlantGO-Slim classification for molecular functions, biological processes and cellular components. Annotations were used to assess GO term enrichment using the Fisher exact test function implemented in Blast2GO.
In total, 6,923 GO terms including molecular functions, biological processes and cellular components were assigned to the 1,801 mapped genes, for an average of 3.9 GO terms per gene. There were 1,285 sequences associated with a molecular function. At level two of the molecular function classification, most of the terms fell in the 'binding' category (49.0%) or in the 'catalytic activity' category (40.0%). At the level three of the classification, five categories each represented about 15% to 20% of the genes; therefore, most of the terms were related either to hydrolase activity, transferase activity, nucleic acid binding, protein binding or lipid binding (Additional file 5). Also, 1,027 sequences were associated with biological processes with a large diversity of 84 terms involved in the annotations (Additional file 5).
Spruce gene linkage maps
Two parental linkage maps for white spruce were estimated de novo from the 500 progeny above for cross D. Parental maps were assembled as previously described , using the genotyping data of the PgLM2 array, as well as data from a previous GoldenGate SNP array on the same 500 progeny . Anonymous markers and gene markers from previous linkage mapping projects involving a different white spruce cross (cross P: C96-1-2516, ♀80112 × ♂80109; 260 progeny) were also considered [56, 82]. The white spruce maps derived from both crosses D and P  were merged with the JoinMap 4.0 function 'Combine Groups for Map Integration'. Before marker ordering within each integrated linkage group, we compared the recombination frequencies estimated for homologous markers from both data sets. For this purpose, a 'heterogeneity test' was conducted with JoinMap4.0. As a result, pairs of loci showing significantly different recombination frequencies (P < 0.01) were eliminated to avoid erroneous marker ordering.
A black spruce (Picea mariana (Mill.) B.S.P.) map  was also used to position 58 additional genes (3% of the total mapped genes) that were not mapped in white spruce. Because the genomes of the two species are highly syntenic and collinear , white spruce and black spruce genetic maps were aligned based on 258 gene markers mapped in common. The white spruce composite map was taken as the reference map; then, black spruce gene loci not positioned in white spruce were transferred onto the white spruce composite map. The position of a transferred gene on the linkage group was estimated as the middle of the interval between the two closest white spruce anchor loci. Hence, the spruce composite map included 1,743 genes from white spruce and 58 genes from black spruce for a total of 1,801 positioned genes.
Gene distribution along the chromosomes
We tested whether the genes were uniformly distributed along the chromosomes or whether they were clustered. For each chromosome, we compared the observed gene distribution with a uniform distribution by using the Kolmogorov-Smirnov test. We computed the maximum difference (noted D
) between the observed distribution F(x) and the empirical distribution F
(x). With large sample sizes such as those considered here (in terms of genes), D
follows a complex distribution and the critical value depends on the sample size. D'Agostino and Stephens proposed a modified statistic named D
* with critical values independent of the sample size . The critical values of statistical significance used for D
* were 0.895 (α = 0.05) and 1.035 (α = 0.01).
Gene density analysis
To identify the location and extent of GRRs, we used the kernel density estimation, a non-parametric technique based on the kernel density function . The choice of the bandwidth is arbitrary and affects the smoothness of the distribution. The function asciker available in the software Stata® 10 (College Station, TX, USA) was used to compute the non-parametric density estimator as well as the confidence intervals at two significance levels (α = 0.05 and 0.01) . We used the adaptive kernel method , which is based on a varying bandwidth instead of an arbitrarily fixed one. This approach enabled us to take into account the density observed locally to compute the bandwidth size. The density and the 95% confidence intervals were calculated with the akdensity function in Stata®10. Then, we compared the position of the confidence interval with the uniform function in the R package. If the lower bound of the confidence interval was greater than the uniform distribution, the region was declared as a GRR (P < 0.05).
Synteny with pine genomes
The gene-based linkage map for the loblolly pine genome  was downloaded from the Dendrome database . Out of 1,816 genes mapped onto the loblolly pine genome, a dataset of 1,666 genes was retrieved from the Dendrome database (accession TG091). These sequences were compared with 27,720 spruce unigenes ) and the 1,801 spruce mapped loci by using the blastn program. Also, 426 mapped loci were collected from maritime pine . For each pine sequence, the best hit among the spruce genes with various thresholds from 70% to 95% of identity was retrieved. All the spruce sequences matching a loblolly pine sequence with at least 95% of identity were located on a homoeologous chromosome. However, to increase the number of matches analysed, a minimum identity level of 80% was retained. Under these circumstances, 7.2% of the matches were not found on homoeologous pairs of chromosome. Moreover, we performed the reciprocal comparison involving mapped spruce genes against the pine sequence database. If one link was found between the pine and spruce chromosomes in both reciprocal analyses, we declared this link as homologous. In a few cases, one gene from one species could match several genes from the other species on the homoeologous chromosome. In such cases, a single link was retained between the two homoeologous chromosomes.
Gene families used in phylogenetic analyses
Phylogenetic analyses were conducted by including angiosperm and gymnosperm (conifer) sequences. These involved 527 mapped spruce genes from 157 families. To assess the representativeness of this sample (Figure 2), we compared the distribution of the molecular function GO terms found across these genes and across the white spruce GCAT gene catalogue  with a two-tailed Fisher exact test. Out of the seven terms assigned at the level two of the molecular function's classification, only two terms were over-represented among the mapped families compared with the gene catalogue (False Discovery Rate (FDR) ≤ 0.01). These terms were 'binding' (GO:0005488) and 'transcription regulator activity' (GO:0030528). Indeed, 31 families encompassing 180 sequences were related to transcription activity or regulation. The four largest groups of mapped transcriptional regulators included 21 genes from the myb-r2r3 family, 17 genes from the b-hlh family, 14 nam genes and 11 aux-iaa genes. At the level three of this classification, only two terms were differentially represented among the mapped families and the spruce gene catalogue (FDR ≤ 0.01). Genes related to nucleic acid binding (GO:0003676) were over-represented while the ones related to transferase activity transferring glycosyl groups (GO:0016757) were under-represented among the mapped genes. Thus, there was no large molecular functional class missing in the sample of 157 families used for the phylogenomic analysis, which was quite representative of the overall relative diversity and abundance of molecular functions seen in the spruce GCAT gene catalogue .
The phylogenetic analyses aimed to classify duplications involving mapped spruce genes as recent or ancient by considering the GA split as a reference (see below). We translated the spruce sequences with the getorf program available in the EMBOSS package . Among all possible open reading frames, we selected the sequence with a match with a protein sequence from another species. The Arabidopsis protein dataset (TAIR7 release) from TAIR ) and the rice protein dataset from the rice annotation database ) were retrieved. We also built a protein sequence dataset from pine. Although not a strict requirement for the present purpose of dating spruce gene duplications, we also retrieved two datasets of EST contigs derived from Pinus pinaster and Pinus taeda . We concatenated the 12,901 sequences from Pinus pinaster and the 72,928 sequences from Pinus taeda into a single file. These sequences were translated with the getorf program to obtain a set of all possible protein sequences. For each spruce sequence, we conducted three blastp searches against the protein datasets from Arabidopsis, rice and pine. In the blastp outputs, we screened the five best hits. We aligned the spruce, pine, Arabidopsis and rice homologous sequences with the kalign program  and we selected manually conserved domains for further phylogenetic analysis. Because the considered angiosperm sequences were from complete genomic sequences, it was not necessary to filter out sequences based on absolute sequence similarity. However, the observed e-values in the blastp searches of the spruce sequences against the Arabidopsis proteome were in the range of e-50 to e-100.
Phylogenetic analyses were conducted for 157 gene families (Additional file 5). To be considered, a family had to contain at least two mapped members on the spruce genome and had to be represented in Pinus as well as in both models and completely sequenced angiosperms Arabidopsis and rice, respectively an eudicot and a monocot. Both the NJ method , an approach based on matrices of substitution rates, and MP analysis  were employed using the package Phylip 3.6 . We used MP and NJ methods instead of other more computer-intensive approaches such as Bayesian or maximum-likelihood algorithms given that the metric that we wanted to estimate, the proportion of ancient versus recent spruce gene duplications relative to the GA split, was based on a large sample of gene families (157), and given that our interpretation of the topologies was conservative because it was based on the strict consensus from MP and NJ trees for each gene family. Results obtained for a few gene families using other more computer-intensive phylogenetic methods resulted in essentially the same consensus trees (results not shown). To estimate NJ trees, distance matrices were calculated using the protdist program with the JTT amino acid substitution matrix and submitted to the program neighbor. Parsimony trees were estimated using protpars. For each family and each method, the robustness of the topologies obtained was assessed by means of 500 bootstraps using the program seqboot. For each method, the consensus tree derived from the bootstrap analysis was the majority-rule consensus generated with the program consense. Then, for each gene family phylogeny, only nodes supported minimally by 50% of bootstraps and in concordance between the two phylogenetic methods were retained, that is, the strict consensus of two majority-rule bootstrap MP and NJ trees.
For each gene family phylogeny, we used the unrooted strict consensus of the MP and NJ trees and estimated the relative age of spruce gene duplications by determining if they occurred before or after the GA split. Nodes involving spruce gene sequences but with no intervening angiosperm sequences from Arabidopsis or rice indicated duplications post-dating the GA split and were referred to as recent duplications. On the other hand, when spruce gene sequences were separated by intervening nodes involving angiosperm sequences from rice and/or Arabidopsis, these duplications predated the GA split and were referred to as ancient. In some cases, ancient gene duplicates produced before the GA split may have been lost both in the Arabidopsis and rice genomes, which would bias upward the number of duplications that would be declared specific to the gymnosperm lineage leading to spruce. Based on the literature, one may argue that this bias should be negligible, given that loss of gene duplicates mostly occurs quickly after duplication if processes such as sub-functionalisation or neo-functionalisation do not occur . At the same time, sampling two divergent and completely sequenced angiosperm genomes (representative of the eudicot and the monocot lineages) should keep such a bias low because gene loss would have to occur independently in both angiosperm lineages after their split, or in their common ancestor during the relatively short period between the GA split (around 300 Mya)  and the divergence of monocots from other angiosperms (around 150 Mya) . For instance, the sister lineage of the large knox-1 gene family in conifers has been lost in rice and other monocots but conserved in Arabidopsis and other eudicots . Finally, because this bias would result in the scoring of a number of false-positive gymnosperm-specific duplications, it would tend to reduce the true proportion between ancient duplications predating the GA split and more recent duplications post-dating this split. If real, the bias is likely negligible, given that the observed value of this proportion was already highly skewed toward ancient duplications (ratio of eight to one, see Results).
All circular genetic maps were drawn with the Circos software .