Overview of WGS strategy and methods
We selected, in two phases, 725 monkeys for WGS from a total of 1,138 VRC monkeys with DNA available in the Biological Sample Repository at UCLA. In the preliminary study phase we selected 105 monkeys for initial WGS, to evaluate the relationships among sequencing depth, variant identification, Mendelian error rate, and pedigree structure. The monkeys selected for the preliminary study included i) four monkeys for high coverage (about 30×) WGS from the surviving generation that was closest to the VRC founder generation, all having multiple (14–115) descendants available for WGS; and ii) 101 monkeys for medium coverage (about 4–6×) WGS from “middle” generations, emphasizing either those with multiple descendants available for WGS or those with extensive biological samples available for phenotypic investigations.
We then selected for sequencing in the next phase of the project an additional 620 monkeys, chosen based either on i) their having multiple descendants or ii) their having been assessed for ≥ 7 phenotypes or having multiple biological samples available. Results from analysis of the preliminary study enabled us to devise an algorithm to rank these 620 monkeys for assignment to high coverage, medium coverage, and low coverage (about 1×) WGS, where monkeys with a higher rank would be sequenced at greater depth than monkeys with a lower rank.
We identified segregating sites in 17 high coverage (30×) monkeys and then called genotypes at these sites in the full set of 725 monkeys (105 + 620). We then conducted a series of filtering steps to ensure that we were retaining high-quality, polymorphic sites, including genotype calling that incorporated linkage disequilibrium (LD) and Mendelian constraints. By thinning these filtered high-quality sites to remove markers with redundant genetic information (based on LD), we generated the final association and linkage SNP sets. The following sections provide details of all steps.
Generation of WGS data
We generated the WGS data for 725 monkeys from genomic DNA provided by the Systems Biology Sample Repository at UCLA. We have generated 100 base pair short read sequences on the Illumina HiSeq2000 instrument from short insert libraries (200–400 bp) for 725 vervets at The Genome Institute (TGI) at Washington University in St. Louis and the McGill University and Genome Quebec Innovation Centre. We have submitted all WGS sequences to the Sequence Read Archive (SRA) under the NCBI BioProject number PRJNA240242.
Preliminary study
Read mapping of preliminary sequence data and refining genotype calls
We followed the published protocol of the 1000 Genomes Project [12] for read mapping of the 105 WGS produced in the preliminary study, with minor modifications related to read trimming detailed in Additional file 3: Supplementary methods Initial variant calling of the alignment files used SAMtools [13]. Due to the size and complexity of the VRC pedigree, we were unable to refine these genotype calls using pedigree-aware methods such as Polymutt [14]. We therefore broke the pedigree into units (trios and parent-offspring duos) that could be analyzed with the programs TrioCaller [15] and Beagle 4.0 [16]. We broke full sibships into distinct trios by replicating parental information. Similarly, for male monkeys who have had offspring with multiple different female monkeys, we replicated this information for each offspring.
With the pedigree broken into trios, duos, and singletons, we used Beagle to first obtain a pedigree-specific haplotype reference panel from the distantly related high coverage monkeys. We then used this reference panel to phase and impute genotypes of all sequenced members of the pedigree. Finally, we used TrioCaller to refine previous haplotypes by incorporating trio constraints and read depth information.
After this step we derived two consensus haplotypes for each monkey, splitting replicated haplotypes from the same monkey into two clusters, corresponding to two chromosomes, based on the Hamming distance between haplotypes. Within each cluster we built a consensus haplotype by accepting the majority call at each locus. We applied the above procedure only to SNPs on autosomes, as the Mendelian transmission model of Beagle and TrioCaller is not applicable to chromosomes X and Y.
Down-sampling analysis to determine coverage strategy
We used the preliminary data to evaluate the impact of WGS coverage in monkeys, in relation to their position within the pedigree, by conducting analyses of down-sampling (reducing the coverage of a monkey, in silico, to a prespecified level). We considered parent-offspring units for this analysis and determined the relative importance of selected parents, compared to their offspring, in influencing the accuracy of genotype assignments. We then extrapolated the results to the entire pedigree.
The down-sampling analysis included three in silico experiments, each repeated three times, on preliminary WGS data. All three down-sampling experiments involved reducing the coverage of one trio with the best available coverage (father sequenced at 30×, mother at 4–6×, offspring at 4–6×) to three different settings. To down-sample the level of coverage of one monkey in silico to the prespecified target coverage, we uniformly sampled a fraction (target-coverage/existing-coverage) of all its existing sequencing reads that had been aligned to a reference genome; this step generated a new alignment file with reduced coverage. We left the alignment files of other monkeys in the pedigree unchanged, and then applied the same calling procedure as that described above to this revised set of 105 alignment files. We then calculated and compared the Mendelian error rates and genotype concordance rates before and after down-sampling.
Sequence coverage ranking
The down-sampling experiment enabled us to evaluate the effect of sequencing coverage on SNP Mendelian error rates in the context of a pedigree, and results suggested that a ranking strategy, based on the number of a monkey’s direct descendants, would be useful to determine sequence coverage of the remaining 620 monkeys. The rank of a monkey would reflect its importance in holding down the error rate of SNPs of all sequenced monkeys, not just the error rate of the monkey with its offspring/parents. We devised a greedy algorithm to rank monkeys in terms of the impact that their sequencing would have on the frequency of errors in SNP calling, over the entire pedigree. We applied this algorithm to all 725 monkeys, assigning the 620 monkeys not yet sequenced to different sequencing coverage classes based on the ranking. This procedure involved the following steps:
-
1)
Construct the pedigree as a directed graph. Each node represents a monkey. Each edge goes from one parent (father/mother) to a child. Remove monkeys that will not be sequenced.
-
2)
Initialize an empty set (child-set). It will store the ID of monkeys for whom one parent has been already ranked. No identical IDs are stored in it.
-
3)
Start with the monkey that has the most direct descendants. Assign that monkey “rank 1” and add its direct descendants into the offspring-set.
-
4)
Select as “rank 2” the remaining monkey that would most increase the size of the offspring-set; if there are ties (contribution to the offspring-set is identical for >1 monkey), older monkeys are given lower rank.
-
5)
Stop when there are no monkeys left to be ranked.
Execution of the sequencing strategy
Sequence alignment of combined data
We conducted read mapping, as described above, of the entire set of 725 WGS monkeys; this step included re-mapping of the 105 monkeys from the preliminary study, described above, given updates to the reference genome assembly since we had produced the data from that study. We performed read mapping for the entire set using the final pre-submission version of the Vervet Reference Genome, consisting of 2,199 scaffolds, and followed the procedures described above for read trimming, alignment, and marking of duplicates.
Additionally, we then performed a base quality score recalibration, and local indel realignment procedures [17]. Because a set of known variants is not available in vervets, we generated an initial “bootstrap” set of SNP and indel variants by running SAMtools on 82 monkeys with coverage >=10× and filtering out all loci outside of a twofold median-depth range or missing fraction more than 50 %. We then input the positions of these loci to GATK’s Base Quality Score Recalibration (BQSR) and local-indel realignment procedures [17].
Stage 1: Identify unequivocal segregating sites from 17 high coverage monkeys
To assemble SNP sets for linkage and association analyses, we wished to include only highly polymorphic and reliable variants. We therefore focused on identifying such polymorphic variants among the most deeply sequenced monkeys. We used a local-assembly-based haplotype caller, Platypus [18], to discover variants from alignment files of 17 high coverage (HC) monkeys. This caller, unlike single-site callers (SAMtools and GATK’s UnifiedGenotyper) used in the 1000 Genomes Project, employs two measures to reduce genotyping errors: 1) It assembles reads within a window (set to 500 Kb in our case) into two haplotypes and then discovers variants by comparing the assembled haplotypes and the reference genome; 2) by local haplotype assembly, it corrects the indel-induced local wrong alignment and excludes wrongly placed reads from the assembled haplotypes. We applied a series of filters (described in Additional file 3: Supplementary methods) to the sites identified in the 17 HC monkeys to reduce the set of variants passed on to Stage 2. Importantly, to retain polymorphic markers that would be the most useful in genetic mapping, we filtered out variants with minor allele frequency (MAF) < 25 % in the set of 17 HC monkeys. As we selected these 17 HC monkeys primarily because they are ancestral to a substantial proportion of the current pedigree, we considered it likely that variants observed in only one or a few monkeys in this set would not be well represented in the pedigree overall. We therefore implemented this relatively high MAF filter to maximize the identification of variants that would be common throughout the pedigree and therefore most useful for genome-wide linkage or association analyses.
Stage 2: Assign genotypes at the SNPs to all monkeys, using SAMtools, with refinement using TrioCaller and Beagle
We used SAMtools to assign genotypes to all 725 sequenced monkeys, at the SNPs identified in Stage 1. As described for the preliminary study, we subsequently refined SAMtools calls with methods that used LD and Mendelian constraints (TrioCaller and Beagle). We then discarded loci whose coverage is outside the twofold range of global median coverage, loci whose minor allele frequency is below 10 % (as estimated in the full set of 725 monkeys), and loci with >=50 % missing calls.
We lifted over the scaffold-based coordinates of VRC SNPs retained in Stage 2, from the pre-submission version of the reference assembly, to the chromosome coordinates of the NCBI-released version (Chlorocebus_sabaeus 1.1, GCA_000409795.2), using GATK’s LifeOverVariants procedure. We applied additional filters to remove SNPs with questionable positions (details are given in Additional file 3: Supplementary methods).
Stage 3: Sample-level QC of WGS data and refinement of pedigree structure
Once we had obtained genotypes for all 725 monkeys at the variant sites identified in the 17 high coverage monkeys, we performed sample-level QC to identify possible sample mix-ups or contamination. Using PLINK [19], we LD-pruned the SNPs that passed Stage 2, employing a very low LD threshold (pairwise r2<=0.1, window size=50, shift=20). We then estimated the pairwise IBD sharing for all pairs of monkeys using this set of roughly independent SNPs. We then compared the PLINK IBD estimates with the kinship estimates that were entirely based on the pedigree structure, as estimated by SOLAR [20], and identified those monkeys frequently involved in discordant pairs. We also checked the genotype concordance between the WGS SNP set and a dataset of SNP genotypes, generated previously, using independent methods, for fine mapping of QTL in the VRC [4]; for this comparison we used 415 monkeys for which we had genotypes from both the WGS SNPs and the previously generated SNPs.
At the start of the WGS studies, parentage was known for monkeys born prior to 2008; this information derived from microsatellite-based genotyping [3] together with observational data. For monkeys born during or after 2008 (N=174) we attempted to identify parentage using the pairwise IBD relationships (estimated with WGS data as described above) between these monkeys and all others.
To assign parentage we required: 1) identification of levels of IBD sharing expected between parent-offspring, with at most one monkey of each gender demonstrating such a level of sharing to monkeys of unknown parentage; 2) for mothers, concordance between the WGS IBD information with data from colony records on observed mother-infant behavior; 3) for fathers, concordance between WGS IBD information and data from colony records indicating a set of possible fathers (based on their sharing housing with the mothers and infants in question); 4) a plausible age difference between putative parents and offspring (within a range of 4–15 years).
Stage 4: Generating the final SNP mapping sets and multipoint IBD files
After removing data from samples suspected of contamination or mislabeling, and updating the pedigree based on newly identified parent-offspring relationships, we repeated the genotype call refining (Beagle + TrioCaller) step, described in Stage 2. We then applied the following filters to generate our final mapping sets: 1) removed SNPs with >=5 Mendel errors between parents and offspring; 2) removed SNPs with excess heterozygosity (heterozygous fraction>0.6); 3) marked the remaining (sporadic) parent-offspring Mendelian-error genotypes as missing; 4) removed SNPs with missing rate >5 %.
We then thinned the SNP dataset to generate two genetic mapping marker sets, one appropriate for association (retaining SNPs with pairwise r2<=0.9) and one appropriate for linkage mapping (retaining SNPs with pairwise r2<=0.4). In conducting the pruning for both marker sets we used PLINK, with window size=50 and shift size =20. Lastly we used PedCheck [21] to detect SNPs that showed pedigree-wide Mendel errors and removed them from both mapping sets. We deposited in NCBI the WGS-based genotype data from 721 VRC vervets (used to construct the 500K and 150K SNP mapping sets); these data are publically available under BioProject PRJEB7923, and browsable at [22].
To further enhance the utility of the SNP sets, we used LOKI [23] to construct multipoint identity by descent (MIBD) files. These files summarize the probability that, at a particular location in the vervet genome, a pair of sequenced monkeys share genotypes IBD. These estimates are the basis for multipoint variance component linkage analysis in SOLAR, the most widely used approach for pedigree-based QTL analysis. In order to evaluate MIBD, which is very computationally intensive, we used LD pruning to obtain a subset of SNPs from the linkage mapping SNP set. This reduced set of markers was adequate for MIBD evaluation, as the close connections among the inbred pedigree has led to IBD sharing over long chromosomal segments.