Comparison of village dog and wolf genomes highlights the role of the neural crest in dog domestication

Background Domesticated from gray wolves between 10 and 40 kya in Eurasia, dogs display a vast array of phenotypes that differ from their ancestors, yet mirror other domesticated animal species, a phenomenon known as the domestication syndrome. Here, we use signatures persisting in dog genomes to identify genes and pathways possibly altered by the selective pressures of domestication. Results Whole-genome SNP analyses of 43 globally distributed village dogs and 10 wolves differentiated signatures resulting from domestication rather than breed formation. We identified 246 candidate domestication regions containing 10.8 Mb of genome sequence and 429 genes. The regions share haplotypes with ancient dogs, suggesting that the detected signals are not the result of recent selection. Gene enrichments highlight numerous genes linked to neural crest and central nervous system development as well as neurological function. Read depth analysis suggests that copy number variation played a minor role in dog domestication. Conclusions Our results identify genes that act early in embryogenesis and can confer phenotypes distinguishing domesticated dogs from wolves, such as tameness, smaller jaws, floppy ears, and diminished craniofacial development as the targets of selection during domestication. These differences reflect the phenotypes of the domestication syndrome, which can be explained by alterations in the migration or activity of neural crest cells during development. We propose that initial selection during early dog domestication was for behavior, a trait influenced by genes which act in the neural crest, which secondarily gave rise to the phenotypes of modern dogs. Electronic supplementary material The online version of this article (10.1186/s12915-018-0535-2) contains supplementary material, which is available to authorized users.


Background
The process of animal domestication by humans was complex and multi-staged, resulting in the disparate appearances and behaviors of domesticates relative to their wild ancestors [1][2][3].In 1868, Darwin noted that numerous traits are shared among domesticated animal species, an observation that has since been classified as the "Domestication Syndrome" (DS) [4].DS is a phenomenon where diverse phenotypes are shared among phylogenetically distinct domesticated species but absent in their wild ancestors.Such traits include increased tameness, shorter muzzles/snouts, smaller teeth, more frequent estrous cycles, floppy ears, reduced brain size, depigmentation of skin or fur, and loss of hair.
During the domestication process, the most desired traits are subject to selection.This selection process may result in detectable genetic signatures such as alterations in allele frequencies [5][6][7][8][9][10][11], amino acid substitution patterns [12][13][14], and linkage disequilibrium patterns [15,16].Though numerous genome selection scans have been performed within a variety of domesticated animal taxa [5][6][7][8][9][10][11]17], no single "domestication gene" shared across domesticates has been identified [18,19].However, in the context of DS, this is not unexpected given more than a dozen diverse behavioral and complex physical traits fall under the syndrome.Based on the specific range of traits associated with DS, it is likely that numerous genes with pleiotropic effects likely contribute through mechanisms which act early in organismal development [18,19].For this reason, the putative role of the neural crest in domestication has gained traction.
While embryonic neural crest cells are the progenitors of most of the adult vertebrate body, these cells also influence behavior.Functions of the adrenal and pituitary systems, also derived from neural crest cells, influence aggression and the "fight or flight" behavioral reactions, and have found to be differentially affected in domesticates [20].
No domestic animal has shared more of its evolutionary history in direct contact with humans than the dog (Canis lupus familiaris), living alongside humans for tens of thousands of years since domestication from its ancestor the gray wolf (Canis lupus).

Comparison using F ST outlier approaches
We adapted methods from previous selection scan studies in dogs to determine the impact of sample choice (i.e.breed versus village dogs) on the detection of selective sweeps associated with early domestication pressures.We employed sliding window selection scan methodologies similar to those implemented in [26,27] to localize genomic intervals with unusual levels of allele frequency differentiation between the village dog and wolf populations.First, through ADMIXTURE [39] and identity-by-state (IBS) analyses, we identified a collection of 43 village dog and 10 gray wolf unrelated samples (Figure 1a) that have less than 5% dog-wolf admixed ancestry (see Methods).
Principal component analysis illustrates the genetic separation between village dogs and wolves, and largely reflects the geographic distribution of these populations (Figure 1b and 1c).Next, we calculated average F ST values in overlapping 200 kb sliding windows with a step-size of 50 kb across the genome.As in [26,27], we performed a Ztransformation to normalize the resulting values and identified windows with a ZF ST score greater than 5 (autosomes) or 3 (X chromosome) as candidate domestication regions.Following merging, this outlier procedure identified 31 CDRs encompassing 12.3 Mb of sequence (Additional File 1: Table S1).As in previously studies, a 550 kb region on chromosome 6 (46.80-47.35Mb) that contains the pancreatic amylase 2B (AMY2B) and RNA Binding Region Containing 3 (RNPC3) genes had the highest observed average ZF ST score (ZF ST = 7.67).
Only 15 of these 31 regions intersect with those reported in [26] and [27] (Figure 2a).To further explore this discrepancy, we visually assessed whether the dog or wolf haplotype is present at the loci reported in these earlier studies in 46 additional canine .CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not this version posted January 12, 2018.; https://doi.org/10.1101/118794doi: bioRxiv preprint 6 samples, including three ancient European dogs ranging in age from 5,000-7000 years old (see Methods; [21,34]).Likely due to the absence of village dogs in their study, some Axelsson loci appear to contain selective sweeps associated with breed formation, as evidenced by the presence of the wild haplotype in ancient and village dogs (example in Figure 2b).Although all autosomal sweeps identified by [8] intersected with CDRs from our study, seven X chromosome Cagan and Blass windows did not meet the thresholds of significance from our SNP sets (example in Additional File 2: Figure S1).We performed F ST scans and Z transformations for windows on autosomes and X chromosome together, which may falsely inflate F ST signals on the X due to smaller effective population sizes and correspondingly higher expected levels of genetic drift on the X chromosome.More detailed analysis of the loci highlighted in these two studies will be elaborated in the following section.

Refined identification of differentiated loci using demographic models and ancient genomes
Our results indicate that the use of village dogs in selection scans identifies novel candidate domestication loci.However, we further refined these F ST -based domestication scans to account for several shortcomings.First, rather than setting an empirical threshold at a ZF ST score of 5, we created a neutral null model that captures key aspects of dog and wolf demographic history (Additional File 2: Figure S2; Additional File 3: Table S2; [34,40]).We identified 443 autosomal sliding windows with F ST values that exceed the 99 th percentile of the neutral simulations (F ST = 0.308; Additional File 2: Figure S3a; Additional File 2: Figure S4a).Second, reasoning that a true domestication sweep will be largely fixed among extant dogs with no recent wolf admixture, we calculated pooled heterozygosity (H P ) in village dogs within the same window boundaries, and retained windows with a H P lower than the 0.1 th percentile observed in our simulations (Additional File 2: Figure S4b).This heterozygosity filter removed 199 of the 443 windows (Additional File 2: Figure S3b).Finally, we excluded regions where the putatively selected haplotype is not found in ancient dog samples.To do this, we calculated the difference in dog H P (ΔH P ) with and without the inclusion of two ancient dog samples HXH, a 7ky old dog from Herxheim, Germany ( [34]) and NGD, .CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not this version posted January 12, 2018.; https://doi.org/10.1101/118794doi: bioRxiv preprint 7 a 5ky old dog from Newgrange, Ireland ( [21,34]); see Methods).Windows with ΔH P greater than the 5 th percentile of all windows genome-wide (ΔH P = -0.0036)were removed (Additional File 2: Figure S3c; Additional File 2: Figure S4c and d).Remaining overlapping windows were merged, resulting in 58 autosomal CDRs that encompass 18.65 Mbp of the genome and are within 50kb of 248 Ensembl gene models (Additional File 4: Table S3; Figure 3b).

Assessment of previously identified CDRs
We applied the same filtration parameters to putative dog domestication sweeps identified on autosomes in Axelsson et al. (N = 30; [26]) and Cagan and Blass (N = 5; [27]) (Additional File 2: Figure S5a and b).Since window coordinates of these studies may not precisely match our own, we selected the maximum F ST value per locus from our village dog and wolf data.We then removed any locus with F ST , H P , and ΔH P levels not passing our thresholds.Following these three filtration steps only 14 Axelsson, and 4 Cagan and Blass loci remained.Though few autosomal loci were reported in Cagan and Blass, 4 of the 5 loci passed and all have a corresponding sweep in either Axelsson or this study.These comparisons reinforce that sampling only or mostly breed dogs identifies many candidate sweeps that may have arisen post-domestication or as a result of breed formation.We also assessed the 349 loci identified by [29] by various statistics, and found that only 41 loci remained after similar filtrations (Additional File 2: Figure S5c)

Increased resolution for domestication scans using the XP-CLR statistic
We further refined our search for domestication sweeps in village dogs using XP-CLR, a statistic developed to identify loci under selection based on patterns of correlated allele frequency differences between two populations [41].Since we are searching for regions under selection in the dog genome, wolves were set as our reference population and XP-CLR was run on both simulated and real SNP datasets with a spacing of 2 kb, and a window size of 50 kb.Average XP-CLR values were calculated within 25 kb sliding windows (10 kb step size) for both datasets, and we retained 889 windows with scores greater than the 99th percentile from simulations (XP-CLR = 19.78;Additional File 2: . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not this version posted January 12, 2018.; https://doi.org/10.1101/118794doi: bioRxiv preprint 8 Figure S6a).Using methods similar to those employed for F ST scans, windows with village dog H P values less than the 0.1 st simulation percentile (H P = 0.0598) or where the ancient dog samples carried a different haplotype (ΔH P filtration threshold at 5 th percentile = -0.0066)were eliminated (Additional File 2: Figure S6b-d; Additional File 2: Figure S3c).This resulted in 598 autosomal windows which we merged into 246 candidate loci, encompassing 10.81 Mb of genomic sequence (Figure 3a; Additional File 5: Table S4).These windows are located within 50 kb of 178 Ensembl gene models, and no SNPs with high F ST within the intervals had deleterious effects on coding sequence, based on Variant Effect Predictor (Additional File 6: Table S5; [42]).
Finally, the vast majority of the filtered XP-CLR sweeps (204/246) are novel to this study, 21 of which intersect F ST loci from the non-outlier approach.

Gene enrichment among 246 refined candidate domestication regions
We sought to identify gene sets and pathways enriched within our candidate domestication loci.Because of the increased resolution provided by XP-CLR (i.e.smaller window size), we focused the following gene analyses on the autosomal XP-CLR loci.Of these regions, 78% were within 50 kb of at least one gene.Based on 1,000 randomized permutations (see Methods), we found that the XP-CLR sweeps are not more likely to localize near genes than expected (p = 0.07), though the loci are near a greater total number of genes than random permutations (p = 0.003; Additional File 2: Figure S7a and b).Finally, gene set or pathway enrichment results can be influenced by gene size, as small genes are less likely to intersect with a sweep, potentially biasing toward enrichment of pathways that contain long genes.We observed that our sweeps contain genes of the similar average length as found in the randomized set (p > 0.05; Additional File 2: Figure S7c).Gene ontologies were assigned to each dog Ensembl gene model using the BLAST2GO pipeline (see Methods).Enrichment tests were performed by the topGO R package [43] using the intersecting gene models as the test set, and the autosomal dog genes as the reference.In total, 292 significantly enriched GO categories were identified (p < 0.05) using the Parent-Child model of Fisher's Exact Test [44].To .CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not this version posted January 12, 2018.; https://doi.org/10.1101/118794doi: bioRxiv preprint 9 eliminate false positives, we also performed GO assignments and topGO enrichment tests for genes intersecting the randomized permutations and eliminated any XP-CLR GO category that was observed in more than 50 permuted enrichments (rate > 0.05) was excluded.After this filtration, 243 GO categories remained (Additional File 7: Table S6), 17 consisting of twenty or more XP-CLR loci.Lessening bias from singletons, we focused on the 174 GO terms represented by more than one XP-CLR locus.The top categories include GO categories associated with DNA replication (post-replication repair), transcriptional activity (STAT cascade, transcriptionally active chromatin), cellular localization, signaling, insulin secretion, and organismal development (midbody, post-embryonic animal organ morphogenesis).Additional biological pathways are highlighted below.

Candidate Genes Influencing Retinoic Acid Signaling
Retinoic acid (RA) is a signaling molecule that has numerous critical roles in development at the embryonic level, continuing into adult stages with roles including maintaining stem cell proliferation, tissue regeneration, and regulation of circadian rhythm [45,46].The highest scoring XP-CLR locus centers upon RAI1 (Retinoic Acid-Induced 1; Figure 4), a gene with numerous developmental functions in the RA pathway and the gene responsible for Smith-Magenis and Potocki-Lupski Syndromes in humans ( [47,48]).Emphasizing the possible role of altered retinoic acid signaling in dog domestication, regulation of RA receptors were significantly enriched (GO:0048385; p = 0.0496).This category includes NR2C1 (Nuclear Receptor Subfamily 2 Group C Member 1), essential for the development of early retina cells through regulation of early transcription factors that govern retinal progenitor cells such as RA receptors [49].The second gene encodes Calreticulin, a protein involved in inhibition of both androgen and RA transcriptional activities [49,50].Additionally, Ncor2 (Nuclear Receptor Corepressor 2), found in In XP-CLR_209, increases cell sensitivity to RA when knocked out in mice [51].
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not this version posted January 12, 2018.; https://doi.org/10.1101/118794doi: bioRxiv preprint 10

Candidate Genes Regulating Brain Development and Behavior
An abundance of genes and ontologies associated with neurological function, neurotransmission, and behavior were observed in our XP-CLR loci.GO categories such as neurotrophin binding, regulation of neurotransmitter levels, presynapse, as well as spines of both the neuron and dendrite were significantly enriched.Additionally, twelve genes, each from distinct XP-CLR loci, are linked to presynapse function (GO:0098793; p = 0.0143).Within these and additional loci, we observe numerous genes associated with neurotransmission.Representing two behavior-associated neurotransmitter pathways, the serotonin transporter SLC6A4 and dopamine signaling members GNAQ and ADCY6 (GO:0007212; p = 0.044) are also present.Genes associated with glutamate, the excitatory neurotransmitter, include DGKI (Diacylglycerol kinase iota; ranked 6th by XP-CLR), which regulates presynaptic release in glutamate receptors [52], and GRIK3 (Glutamate receptor, ionotropic, kainate 3), a glutamate receptor [53].Finally, UNC13B (Unc-13 Homolog B) is essential for competence of glutamatergic synaptic vesicles [54], and CACNA1A (Calcium Voltage-Gated Channel Subunit Alpha1 A) influences glutamatergic synaptic transmission [55].
In contrast to glutamate, GABA is the nervous system's inhibitory neurotransmitter, and has been linked to the response to and memory of fear [53,56,57].Genes in our XP-CLR loci relating to GABA include one of the two mammalian GABA biosynthetic enzymes GAD2 (or GAD65; ranked 20th), the GABA receptor GABRA4, auxiliary subunit of GABA-B receptors KCTD12 ( [52,58]), and the GABA inhibitor osteocalcin (or BGLAP; [59]).Lastly, T-Cell Leukemia Homeobox 3 (TLX3) is a key switch between glutamatergic and GABAergic cell fates [60].

Candidate Genes with Core DNA and RNA Functions
Pathways involved in core DNA and RNA processes are also well represented within the XP-CLR regions.Highly significant GO categories relate to the STAT (signal transducers and activators of transcription) cascade and its regulation (3 rd , 6 th , and 7 th top categories), represented by ten genes from nine unique XP-CLR loci including Interleukins (IL22, IL26), an Interleukin receptor (IL31RA), SOCS3 (Suppressor Of .CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not this version posted January 12, 2018.; https://doi.org/10.1101/118794doi: bioRxiv preprint 11 Cytokine Signaling 3), and CYP1B1 (Cytochrome P450 1B1).In addition to the STAT transcriptional activation pathway, genes involved in maintaining transcriptionally active chromatin, positively regulating transcription from RNA polymerase III promoter, and cryptic unstable transcript catabolism (CUT) are also enriched in our GO terms.

Candidate Genes Involved In Cell Cycle Control
Nine genes involved in the regulation of cell cycle progression were located in XP-CLR loci (GO:1901990; p = 0.039), including BIRC5 (or survivin) which is an apoptosis inhibitor [67,68].Other enriched categories included postreplication repair (GO:0006301; p=0.013) and cell maturation (GO:0048469; p = 0.01299).Perturbations in any of these pathways could influence numerous biological functions such as cell proliferation, cell fate/death, and repair.

Survey of copy-number variation between dogs and wolves
Copy-number variants have also been associated with population-specific selection and domestication in a number of species [5,69,70].Since regions showing extensive copynumber variation may not be uniquely localized in the genome reference and may have a deficit of SNPs passing our coverage thresholds, we directly estimated copy number along the reference assembly and searched for regions of extreme copy number differences (see Methods).Using V ST, a statistic analogous to F ST [69], we identified 67 regions of extreme copy number difference between village dogs and wolves which are 12 within 50 kb of 89 unique genes ("Variant Candidate Domestication Regions" or VCDRs; Additional File 8: Table S7; see Additional File 9: Note 3).There was no overlap of VCDRs with regions identified through F ST or XP-CLR.
Relative to randomly permuted intervals, the 67 VCDRs are more likely to be near genes (p < 0.01; Additional File 2: Figure S8a) but do not encompass more total genes than expected (p > 0.05; Additional File 2: Figure S8b).With methods implemented for XP-CLR loci, gene ontology enrichments and permutations were performed for genes within 50 kb of the VCDRs.GO enrichment analysis identified 111 significantly enriched categories (p < 0.05 using Parent-Child statistics; [43]).However, only 104 categories remained after filtration of GO categories found in >5% of random permutations (Additional File 10: Table S8).Five enriched terms were supported by more than two VCDRs including regulation of circadian rhythm (N=2 VCDRs), lipid droplet (N=3), calcium and calmodulin responsive adenylate (N=2), cyclase activity (N=2), steroid binding (N=2) and beta-catenin binding (N=2).

Refined analysis of a large-scale structural variant located at the AMY2B locus
The top VCDR encompasses the AMY2B gene, which at increased copy number confers greater starch metabolism efficiency due to higher pancreatic amylase enzyme levels [5,38].Quantitative PCR results have suggested an ancient origin for the AMY2B copy number expansion, as 7ky old Romanian dogs exhibit elevated AMY2B copy number [26,37].However, read-depth analysis shows that the AMY2B tandem expansion is absent in 5-7ky old ancient European dogs [34].However, read-depth profiles identified two large duplications, one of 1.9 Mb and the other of 2.0 Mb, that encompass AMY2B (Additional File 2: Figure S9).We quantified copy number at AMY2B itself and regions which discriminate the two segmental duplications in 90 dogs using digital droplet PCR (ddPCR).Copy number estimated through read depth strongly correlated with estimates from ddPCR (Additional File 2: Figure S10) confirming the presence of standing copy number variation of AMY2B in dogs (range of 2n AMY2B = 2-18) and distinguishing the two large-scale duplications (Additional File 2: Figure S11).The extreme AMY2B copy number expansion appears to be independent of the large-scale .CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not this version posted January 12, 2018.; https://doi.org/10.1101/118794doi: bioRxiv preprint 13 duplications, as ddPCR results show that some dogs without the large duplications still have very high AMY2B copy number (Additional File 2: Figure S11).Read-depth patterns at the duplication breakpoints indicated that NGD, the ancient Irish dog, harbored the 2.0 Mb duplication resulting in increased AMY2B copy number.

Discussion
Genetic and archaeological data indicate that the dog was first domesticated from Eurasian gray wolves well over 10 kya [25,34,40,71].Evidence suggests that the domestication process was complex and may have spanned thousands of years [3,71].
Through multiple analyses, we have identified regions that are strongly differentiated between modern village dogs and wolves and which may represent targets of selection during domestication.Our approach differs from previous studies in several ways including the use of village dogs rather than breed dogs, using neutral simulations to set statistical cutoffs, and filtering candidate loci based on ancient dog DNA data.Most (83%) of the 246 candidate domestication regions we identified are novel to our study, which we largely attribute to reduced signals associated with post-domestication breed formation.We argue that swept haplotypes identified in modern village dogs and also present in Neolithic dogs more likely represent signals of ancient selection events.
While the use of neutral simulations accounts for genetic diversity in both wild and domestic sampled populations, and better controls false positive rates than arbitrary empirical thresholds [29,72], several limitations are still apparent in our approach.The demographic model we used does not capture all aspects of dog history, does not include the X chromosome, and does not fit all aspects of the observed data equally well.This likely represents unaccounted for features of the data, such as unmodeled population structure, as well as technical issues such as reduced ascertainment of low frequency alleles due to sequencing depth.Although the inclusion of ancient samples allows for the removal of candidate sweeps that are unique to modern dogs, this approach is limited by the narrow temporal (5-7 kya) and geographic (restricted to Europe) sampling offered by the available data.Even though most selected alleles likely preexisted in the ancestral wolf population, our approach identifies regions where modern dogs all share the same haplotype.However, even when selection acts on preexisting mutation a single haplotype often reaches fixation [73], consistent with the variation patterns we identify across village dog populations.
Analysis of genes near the 246 loci identified using XP-CLR identified numerous overrepresented ontologies including developmental, metabolic, signaling, and neurological pathways (Additional File 7: Table S6).Importantly, our gene annotations were obtained directly through established BLAST2GO pipelines [74], rather than assignment of function via homologous human genes, as has been employed in earlier dog domestication studies [26,28,29].We additionally performed permutations to remove functional categories likely to be observed by chance.Importantly, the genomic positions of the genes in most enriched categories are not co-localized, implying that the observed enrichment is not an artifact of the clustering of genes with similar functions.

The role of the neural crest in dog domestication
Our CDRs include 52 genes also identified in analysis of other domesticated or selfdomesticated animals [9,11,17,[75][76][77][78][79], including four genes (RNPC3, CUEDC1, GBA2, NPR2) in our top 20 XP-CLR loci.No gene was found in more than three species, consistent with the hypothesis that no single domestication gene exists [18].Although the overlap of specific genes across species is modest, there are many enriched gene pathways and ontologies shared in domesticates including neurological and nervous system development, behavior, reproduction, metabolism, and pigmentation [10,11,17,75,80,81].We attribute these patterns to the domestication syndrome (DS), a phenomenon where diverse traits, manifested in vastly different anatomical zones, appear seemingly disconnected, yet are maintained across domesticates.Two possible modes of action could generate the DS phenotypes while still displaying the genomewide distribution of sweeps.The first would require independent selection events for distinct traits at numerous loci.Alternatively, selection could have acted on considerably fewer genes that are members of early-acting developmental pathways with broad phenotypic effects.
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not this version posted January 12, 2018.; https://doi.org/10.1101/118794doi: bioRxiv preprint 15 For these reasons, the role of the neural crest in animal domestication has gained support from researchers over recent years [18,82].In 2014, Wilkins et al. [18]established that the vast array of phenotypes displayed in the animal DS mirror those exhibited in mild human neurocristopathies, whose pathology stems from aberrant differentiation, division, survival, and altered migration of neural crest cells (NCCs).These cells are multipotent, transient, embryonic stem cells that are initially located at the crest (or dorsal border) of the neural tube.The initiation and regulation of neural crest development is a multi-stage process requiring the actions of many earlyexpressed genes including the Fibroblast Growth Factor (Fgf), Bone Morphogenic Protein (Bmp), Wingless (Wnt), and Zic gene families [83].Several of the genes identified in our analysis are involved in this transition including members of the Fgf (Fgf1) family as well as a transcription factor (TCF4; [84]), inhibitors (RRM2; NPHP3; [85,86]) and regulators (LGR5; [87]) of the Wnt signaling pathways.
Following induction, NCCs migrate along defined pathways to various sites in the developing embryo.Assignment of identity and the determination of migration routes relies on positional information provided by external signaling cues [88,89].KCTD12, CLIC4, PAK1, NCOR2, DOCK2, and EXOC7 are examples of such genes that are linked to the determination of symmetry, polarity, and/or axis specification [90][91][92][93][94].
Together, our results suggest that early selection may have acted on genes essential to the initiation of the neural crest and the definition of migration routes for NCCs.

NCC-Derived Tissues Linked to Domestication Syndrome Phenotypes
Once in their final destinations, NCC further differentiate as the precursors to many tissues in the developing embryo.Most of the head, for example, arises from NCCs including craniofacial bones, cartilage, and teeth [95,96].Ancient dog remains indicate that body size, snout lengths, and cranial proportions of dogs considerably decreased compared to the wolf ancestral state following early domestication [97].Further, these remains indicate jaw size reduction also occurred, as evidenced by tooth crowding [97].
Such alterations are consistent with the DS and implicate aberrant NCC migration since .CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not this version posted January 12, 2018.; https://doi.org/10.1101/118794doi: bioRxiv preprint 16 decreases in the number of NCCs in facial primordia are directly correlated with reductions in mid-face and jaw sizes [18,98].Genes associated with both craniofacial and tooth development in vertebrates are found in our candidate loci including SCUBE1, which is essential in craniofacial development of mice, and SATB2, which has roles in patterning of the developing branchial arches, palate fusion, and regulation of HOXa2 in the developing neural crest [99][100][101].Lastly, when knocked out in mice, Bicoid-related homeodomain factor PITX1 not only affected hindlimb growth, but also displayed craniofacial abnormalities such as cleft palate and branchial arch defects [102] and influences vertebrate tooth development [103]. Insufficient cartilage, a NCC-derived tissue [96] that consists of chondrocytes and collagen, in the outer ear of humans results in a drooping ear phenotype [104] linked to numerous NC-associated neurocristopathies (e.g.Treacher Collins, Mowat-Wilson, etc.).Analogously, compared to the pricked ears of wolves, dogs predominantly have "floppy" ears [105], a hallmark feature of domesticates [18].SERPINH1, a collagenbinding protein, is embryonically lethal when ablated in mice [106], and appears to be required for chrondrocyte maturation [107].Alterations of activity by genes such as SERPINH1 and those regulating NCC migration may have reduced the numbers of NCCs in dog ears, contributing to the floppy phenotype [18].

Mis-regulation of gene expression may contribute to DS phenotypes
Similar to other domestication scans [6,9,19], we did not find SNPs deleteriously altering protein sequence in our predicted sweeps, indicating that gene loss did not have a significant role in dog domestication.Instead, we hypothesize that alterations in gene regulatory pathways or the regulation of transcriptional activity could contribute to broad DS phenotypes.In addition to over-representations of genes influencing transcriptional control (STAT cascade) and maintenance of transcriptionally active chromatin, we identified two components of the minor spliceosome; RNPC3 and Sf3b1.RNPC3, which affects early development and is linked to dwarfism (Isolated Growth Hormone Deficiency; [108]), is also under selection in cats and humans [17,77].Absence of Sf3b1 disrupts proper NCC specification, survival and migration [109].A further example of the .CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not role of splicing in NC development is that mutations in U4atac, a U12 snRNA subunit gene missing in the current dog annotation, causes Taybi-Lindner syndrome (TALS) in humans.Phenotypes of this syndrome resemble those of the DS including craniofacial, brain, and skeletal abnormalities [110].Thus, proper splicing, particularly for transcripts processed by the minor spliceosome, is required for proper NC function and development.

Genes associated with neurological signaling, circadian rhythms, and behavior
Tameness or reduced fear toward humans was likely the earliest trait selected for by humans during domestication [3,111,112].Recapitulating such selection, numerous physiological and morphological characteristics, including DS phenotypes (i.e.floppy ears, altered craniofacial proportions, and unseasonal timing for mating), appeared within twenty generations when researchers selected only for tameness in a silver fox breeding population [1,113].As the progenitors for the adrenal medulla, which produces hormones associated with the "fight-or-flight" response, hypofunction of NCCs can lead to changes in the tameness of animals [18].The link between tameness and the NC suggests that changes in neural crest development could have arisen first, either through direct selection by humans for desired behaviors or via the "self-domestication" [114,115] of wolves that were more docile around humans.

Numerous candidate genes influencing neurological function and behavioral responses
were enriched in XP-CLR loci including genes in the dopamine, serotonin, glutamate, and GABA neurotransmission pathways.These, coupled with enriched GO categories relating to neuronal projections (e.g.neuron and dendrite spines), suggest that neuronal signaling may have been differentially impacted in dogs through domestication, in agreement with Li et al. [116].
Our top XP-CLR locus was located within RAI1, the gene for Smith-Magenis syndrome (SMS; [117]), a disorder that is associated with craniofacial and skeletal deformations, aggression, developmental delays, altered circadian rhythms, and intellectual disabilities [118].A recent study of breed dogs and wolves by vonHoldt et al. [119] [120].Williams-Beuren and Smith-Magenis syndromes each display multiple features associated with improper NCC development [118,121].In Xenopus, disruption of the transcription factors RAI1 and WSTF (Williams-Beuren Syndrome Transcription Factor, a gene also disrupted in Williams-Beuren syndrome) negatively impacts proper NCC migration, recapitulating the craniofacial defects associated with the syndromes [122,123].Furthermore, both syndromes involve perturbations in sleep patterns [124][125][126][127], and RAI1 has been shown to regulate the circadian rhythm pathway [126].Alterations of sleep patterns occurred in dogs as they shifted from the ancestral nocturnal state of wolves, to that of the diurnal lifestyle also exhibited by humans.In domesticated silver foxes selected for tameness, levels of circadian rhythm determinants (e.g.melatonin and serotonin) were significantly altered compared to wild foxes [128][129][130].This suggests possible overlap between genes influencing behavior, circadian rhythms, and the Domestication Syndrome.

Conclusions
By comparing village dogs and wolves, we identified 246 candidate domestication regions in the dog genome.Analysis of gene annotations in these regions suggests that perturbation of crucial neural crest signaling pathways results in the broad phenotypes associated with the Domestication Syndrome.Additionally, these findings link transcriptional regulation and splicing to alterations in cell differentiation, migration, and neural crest development.Altogether, we conclude that while primary selection during domestication likely targeted tameness, genes that contribute to determination of this behavioral change are also involved in critical, far-reaching pathways that conferred drastic phenotypic changes in dogs relative to their wild counterparts.

Sample Processing and Population Structure Analysis
Genomes of canids (Additional File 11: Table S9) were processed using the pipeline outlined in [34] to produce a data set of single nucleotide polymorphisms (SNPs) using 19 GATK [131].Thirty-seven breed dogs, 45 village dogs, and 12 wolves were selected from the samples described in Botigue et al. [34], and ADMIXTURE [132] was utilized to estimate the levels of wolf-dog admixture within this subset.To account for LD, the data was thinned with PLINK v1.07 (--indep-pairwise 50 10 0.1; [133]), where SNPs with an R 2 value over 0.1 were removed in 50kb windows, sliding 10 sites at a time.The remaining 1,030,234 SNPs were used in five independent ADMIXTURE runs using different seeds, for up to five ancestral populations (K = 1-5).K = 3 had the lowest average cross validation error (0.0373) from the five runs, and was therefore the best fit for the data.To eliminate noise in subsequent analyses, we removed all village dogs with greater than 5% wolf ancestry and wolves with greater than 5% dog ancestry.Fiftyfour samples remained following this filtration.
Following elimination of admixed samples, we called SNPs in 43 village dogs and 11 gray wolves using GATK (v.3.4-46; [131]).Using the GATK VQSR procedure, we identified a high quality variant set such that 99% of positions on the Illumina canine HD array were retained.VQSR filtration was performed separately for the autosomes + chrX pseudoautosomal region (PAR) and the non-PAR region.SNPs within 5 bp of an indel identified by GATK were also removed.We further excluded sites with missing genotype calls in any sample, triallelic sites, and X-nonPAR positions where any male sample was called as heterozygous.The final SNP set contained 7,657,272 sites.
Using these SNPs, we removed samples that exhibited over 30% relatedness following identity by state (IBS) analysis with PLINK v1.90 (--min 0.05; [133]).Only one sample (mxb), was removed from the sample set, a sample known to be related to another mexican wolf in the dataset.Principal component analyses were completed on the remaining 53 samples (43 dogs and 10 wolves) using smartpca, a component of Eigensoft package version 3.0 [134] after randomly thinning the total SNP set to 500,000 sites using PLINK v.1.90[135].Once PCA confirmed clear genetic distinctions between these dogs and wolves, this final sample set was used for subsequent analyses.The SNP set was further filtered for the selection scans to remove rare alleles 20 (minor allele frequencies < 3 out of possible 106 alleles, or 0.028).Finally, village dog and wolf allele frequencies were calculated separately using VCFtools ( [136].

Demographic Model and Simulations
Simulations of dog and wolf demographic history were performed using msprime v.0.4.0 [137].For each autosome, 75 independent simulations were performed using independent random seeds and a pedigree based genetic map [138].A mutation rate of 4 x10 -9 per site per generation with a generation time of 3 years was assumed.The 53 samples were modeled as coming from 10 lineages with population histories adapted from [34,40] (Additional File 2: Figure S2; Additional File 3: Table S2).The simulation is designed to capture key aspects impacting dog and wolf diversity, rather than a definitive depiction of their demography.Resulting simulated SNP sets were filtered for minor allele frequency and randomly thinned to have the same number of SNPs per chromosome as the real SNP datasets used in F ST , XP-CLR, and H P calculations.

F ST Selection Scans
Dog and wolf allele counts generated above were used to calculate the fixation index (F ST ) using the Hudson estimator derived in [139]with the formula: ) where is the allele frequency in population x, and is the number of individuals in population x.
With this equation, the X chromosome could be included in F ST calculations.A custom script calculated the per site F ST across the genome for both the real and 75 simulated SNP sets.Due to differences in effective population size and corresponding expected levels of genetic drift, analyses were performed separately for the X-non-PAR.Ratio of averages for the resulting F ST values were calculated in 200kb sliding windows with 50kb step sizes, and we required each window to contain at least 10 SNPs.Additionally, we calculated per site F ST for each SNP that did not have missing data in any sample.
F ST loci filtration was completed differently for the outlier and non-outlier approach.For the outlier F ST approach, the windows were Z-transformed and only windows with Z-scores ≥ 5 standard deviations were deemed significant for autosomal and X-PAR loci, and ≥ 3 for the X-NonPAR.Significance thresholds for the non-outlier approach were determined as the 99th percentile from F ST score distributions from the simulated genomes.Overlapping windows passing these thresholds were merged.

Pooled Heterozygosity (H P ) and ΔH P Calculations
Per window, dog allele frequencies were used to calculate pooled heterozygosity (H P ) using the following formula from [6]: 2Σn MAJ Σn MIN /(Σn MAJ + Σn MIN ) 2 , where Σn MAJ is the sum of major and Σn MIN minor dog alleles, respectively, for all sites in the window.
Significance thresholds for window filtration was set as the 0.1 th percentile of the H P distribution from the simulated genomes.The change in H P (or ΔH P ) was calculated as the difference in ΔH P with and without the inclusion of the two ancient dog samples (HXH and NGD).The 5 ky old German dog (CTC) was not included in this analysis due to known wolf admixture [34].Windows with ΔH P greater than the 5 th percentile observed genome wide were removed.

XP-CLR Selection Scans
XP-CLR was calculated using dog and wolf allele frequencies at sites described above [41].This analysis requires separate genotype files for each population, and a single SNP file with positions of each SNP and their genetic distance (in Morgans), which were determined through linear extrapolation from the pedigree-based recombination map from [138].Wolves were set as the reference population, and XP-CLR was run on both the simulated and real SNP sets with a grid size of 2 kb, and a window sizes of 50 kb.
Windows that did not return a value (failed) or did not have at least five grids were removed.Average XP-CLR scores were calculated in 25 kb windows (step size = 10kb).
Filtration of real windows with averages less than the 99 th percentile of averaged simulation scores was performed.All remaining adjacent windows were merged if they were within 50 kb distance (i.e. one sliding window apart).

Visualization of sweeps
Forty-six additional canines (e.g.dog breeds, jackals, coyotes; Additional File 11: Table S9) were genotyped at candidate loci identified in this study, as well as those from [26,27], [27], and [29] using autosomal SNPs previously called in [34].SNPs within CDRs of interest were extracted from the SNP dataset using the PLINK make-bed tool with no missing data filter.Per sample, each SNP was classified as 0/0, 0/1, or 1/1 at all loci (1 representing the non-reference allele), and this genotype data was stored in Eigenstrat genotype files, which were generated per window using convertf (EIGENSOFT package; [140]).Custom scripts then converted the Eigenstrat genotype files into matrices for visualization using matrix2png [141].

Gene Enrichment and Variant Annotation
Coordinates and annotations of dog gene models were obtained from the Ensembl release 81 (ftp://ftp.ensembl.org/pub/release-81/and http://useast.ensembl.org/Canis_familiaris/Info/Index,respectively), and a nonredundant annotation set was determined.The sequence of each Ensembl protein was BLASTed against the NCBI non-redundant database (blastp -outfmt 5 -evalue 1e-3 -word_size 3 -show_gis -max_hsps_per_subject 20 -num_threads 5 -max_target_seqs 20) and all blastp outputs were processed through BLAST2GO [142] with the following parameters: minimum annotation cut-off of 55, GO weight equal to 5, BLASTp cut-off equal to 1e -6 , HSP-hit cut-off of 0, and a hit filter equal to 55. Positions of all predicted domestication loci (from this and previous studies) were intersected with the coordinates of the annotated Ensembl canine gene set to isolate genes within the putatively swept regions.Gene enrichment analyses were performed on these gene sets using topGO [43], and we utilized p-values produced by the parent-child test [44].The predicted effects of SNP variants were obtained by the processing of the total variant VCF file of all canine samples by Variant Effect Predictor (VEP; [42]).

Copy Number Estimation Using QuicK-mer and fastCN
We implemented two CN estimation pipelines to assess copy-number in village dogs and wolves using the depth of sequencing reads.The first, fastCN, is a modified version of existing pipelines that considers multi-mapping reads to calculate CN within 3kb windows (Additional File 9: Note 1).By considering multi-mapping reads, copy-number profiles will be shared among related gene paralogs, making it difficult to identify specific sequences that are potentially variable.The second pipeline we employed, QuicK-mer, a map-free approach based on k-mer counting which can accurately assess CN in a paralog-sensitive manner (Additional File 9: Note 2).Both pipelines analyze sequencing read-depth within predefined windows, apply GC-correction and other normalizations, and are able to convert read depth to a copy-number estimate for each window.The signal-to-noise ratio (SNR), defined as the mean depth in autosomal control windows divided by the standard deviation, was calculated for each sample.The CN states called by both the QuicK-mer and fastCN pipelines were validated through comparison with aCGH data from [143].Regions with copy number variation between samples in the aCGH or WGS data were selected for correlation analysis (Additional File 9: Note 3).

V ST Selection Scans
V ST values [69] were calculated for genomic windows with evidence of copy number variation.VST values were Z-transformed and we identified outlier regions as windows exhibiting at least a 1.5 copy number range across all samples, and ZV ST scores greater than 5 on the autosomes and the X-PAR, or greater than 3 in the X-nonPAR.Prior to analysis, estimated copy numbers for male samples on the non-PAR region of the X were doubled.Outlier regions spanning more than one window were then classified as variant candidate domestication regions (VCDRs) (Additional File 8: Table S7).A similar analysis was performed for the unplaced chromosomal contigs in the CanFam3.1 assembly (Additional File 12: Table S10).See Additional File 9: Note 3 for additional methods and details.

Amylase Structural Variant Analysis
We estimated copy-number using short-read sequencing data from each canine listed in Additional File 11: Table S9.Copy number estimates for the AMY2B gene using fastCN were based on a single window located at chrUn_AAEX03020568: 4873-8379.See

Supplementary Figures
Figure S1 -Z

Figure 1
Figure 1 Origin and diversity of sampled village dogs and wolves (A) The approximate geographic origin of the village dog (circles) and gray wolf (triangles) genome samples included in our analysis.Results of a Principal Component Analysis of the filtered village dog (N=43) and gray wolf set (N=11) are shown.Results are projected on (B) PC1 and PC2 and (C) PC3 and PC4.Colors in all figures correspond to sample origins and are explained in the PCA legends.

Figure 2
Figure 2 Comparison with previously published candidate domestication regions (A) Venn diagram of intersecting village dog, Axelsson et al. 2013 (AX), and Cagan and Blass 2016 (CB) CDRs.(B) Genotype matrix for 130 SNPs within chr7: 24,632,211-25,033,464 in AX_14 for 99 canine samples.Sites homozygous for the reference (0/0; blue) and alternate alleles (1/1; orange) are indicated along with heterozygous sites (0/1; white).Each column represents a single SNP (position indicated along top), while each row is a sample.Canid groupings are on the right of the matrix.

Figure 3
Figure 3 Circos plot of genome-wide selection statistics Statistics from multiple selection scans are provided across the autosomes (chromosomes identifiers are indicated in the inner circle).(A) Averaged XP-CLR scores in 25kb windows across the genome.Windows with significant scores (greater than 99th percentile from simulations) are in red, and those that passed filtration are in blue.Genes within significant windows are listed above each region.(B) F ST values calculated in 100kb windows.Values greater than the 99th percentile of simulations are in red.Windows that passed filtration are in green.(C) Averaged pooled heterozygosity (H P ) scores in 25kb windows are indicated.Significant values (less than the 0.1th simulation percentile) are in red.

Figure 4
Figure 4 Selection scan statistics at the RAI1 Locus Selection scan statistics surrounding the Retinoic Acid-Induced 1 (RAI1) locus (chr5: ~41.6-41.2Mb) (A) Per site F ST scores for all SNPs are indicated along with the F ST significance threshold determined by the 99 th percentile of simulations (red dashed line).
Figure S1 -Z transformed F ST scores for Cagan and Blass locus.

Figure S2 -Figure S6
Figure S2 -Demographic model for village dog and wolf populations used in neutral simulations Figure S3 -Filtration pipeline implemented for F ST and XP-CLR windows Figure S4 -Distribution of selection scan statistics for real and simulated F ST windows Figure S5 -Filtration pipeline implemented for Axelsson, Cagan and Blass, and Freedman CDRs Figure S6 Distribution of selection scan statistics for real and simulated XP-CLR windows Figure S7 -Gene intersect statistics from randomized permutations of XP-CLR gene positions Figure S8 -Gene intersect statistics from randomized permutations of V ST gene positions Figure S9 -Read-depth profiles at the AMY2B locus highlights large-scale structural variant Figure S10 -Correlations between the ddPCR and read-depth estimated copy-number for AMY2B and associated segmental duplications

Figure S11 -
Figure S11 -ddPCR results for the AMY2B gene, 1.9 Mb duplication, and the 2.0 Mb duplication linked structural .CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under