Skip to main content

Population structure and pharmacogenomic risk stratification in the United States



Pharmacogenomic (PGx) variants mediate how individuals respond to medication, and response differences among racial/ethnic groups have been attributed to patterns of PGx diversity. We hypothesized that genetic ancestry (GA) would provide higher resolution for stratifying PGx risk, since it serves as a more reliable surrogate for genetic diversity than self-identified race/ethnicity (SIRE), which includes a substantial social component. We analyzed a cohort of 8628 individuals from the United States (US), for whom we had both SIRE information and whole genome genotypes, with a focus on the three largest SIRE groups in the US: White, Black (African-American), and Hispanic (Latino). Our approach to the question of PGx risk stratification entailed the integration of two distinct methodologies: population genetics and evidence-based medicine. This integrated approach allowed us to consider the clinical implications for the observed patterns of PGx variation found within and between population groups.


Whole genome genotypes were used to characterize individuals’ continental ancestry fractions—European, African, and Native American—and individuals were grouped according to their GA profiles. SIRE and GA groups were found to be highly concordant. Continental ancestry predicts individuals’ SIRE with > 96% accuracy, and accordingly, GA provides only a marginal increase in resolution for PGx risk stratification. In light of the concordance between SIRE and GA, taken together with the fact that information on SIRE is readily available to clinicians, we evaluated PGx variation between SIRE groups to explore the potential clinical utility of race and ethnicity. PGx variants are highly diverged compared to the genomic background; 82 variants show significant frequency differences among SIRE groups, and genome-wide patterns of PGx variation are almost entirely concordant with SIRE. The vast majority of PGx variation is found within rather than between groups, a well-established fact for almost all genetic variants, which is often taken to argue against the clinical utility of population stratification. Nevertheless, analysis of highly differentiated PGx variants illustrates how SIRE partitions PGx variation based on groups’ characteristic ancestry patterns. These cases underscore the extent to which SIRE carries clinically valuable information for stratifying PGx risk among populations, albeit with less utility for predicting individual-level PGx alleles (genotypes), supporting the concept of population pharmacogenomics.


Perhaps most interestingly, we show that individuals who identify as Black or Hispanic stand to gain far more from the consideration of race/ethnicity in treatment decisions than individuals from the majority White population.


Pharmacogenomic (PGx) variants are associated with inter-individual differences in drug exposure and response, affecting medication dosage, efficacy, and toxicity [1, 2]. A number of studies have shown racial and/or ethnic differences in drug response [3,4,5,6,7], based in part on group-specific differences in the frequencies of PGx variants [8]. A 2015 review found that 20% of drugs approved over the previous 6 years showed response differences among racial/ethnic groups, and these differences are often translated into group-specific prescription recommendations that are issued on FDA-approved drug labels [7]. Examples of such recommendations include contraindication of Rasburicase, a medication used to clear uric acid from the blood in patients undergoing chemotherapy, for individuals of African or Mediterranean ancestry, and a toxicity warning for the anticonvulsant Carbamazepine in Asian patients. A higher dosage of the immunosuppressive drug Tacrolimus is indicated for African-American transplant patients, whereas a lower initial dose of Rosuvastatin is recommended for Asians. Despite the inclusion group-specific recommendations in a number of drug labels, the utility of racial and ethnic categories in biomedical research, and their relevance to clinical decision making, remain a matter of substantial controversy [9,10,11,12].

Critiques of the use of racial and ethnic categories in biomedical research point to the appalling history of race science [13,14,15] and stress the potential of such research to reify outmoded notions of racial difference [16,17,18]. In addition, race is widely considered to be a social rather than a biological or genetic construct [19,20,21,22,23]. As it relates to clinically relevant PGx variation across groups, the extent to which racial and ethnic categories serve as a reliable proxy for genetic diversity has also been called into question. The authors of the recent commentary “Taking race out of human genetics” make a compelling case for eliminating the use of race as a category in genetic research, asserting that race and ethnicity are taxonomic (i.e., categorical) labels that by definition cannot capture the full complexity of individuals’ genetic ancestry (GA) [24]. They suggest that genetics research should instead focus on biogeographically defined populations and GA, as opposed to racial categories, and for this study, we hypothesized that GA should better partition PGx variation than race and ethnicity. We posit that GA provides a number of advantages over racial/ethnic categories for biomedical research: (i) it can be characterized independently of the social and environmental dimensions of race/ethnicity, (ii) it can be measured objectively and with precision, and (iii) it can be quantified as a continuous variable, as opposed to categorical racial/ethnic labels. Indeed, a number of recent studies have focused on PGx variation among populations defined by GA rather than racial and ethnic groups [25,26,27,28,29,30,31].

The goal of this study was to compare the relative utility of self-identified race/ethnicity (SIRE) versus GA for partitioning PGx variation among populations in the United States (US). We focused on individuals aged 50 and older, 75% of whom take prescription medication on a regular basis [32], and restricted our study to the three largest racial/ethnic groups in the US: White, Black (or African-American), and Hispanic/Latino [33]. Our study cohort is made up of 8628 participants from the Health and Retirement Study (HRS) [34], for whom we had both SIRE information and whole genome genotypes. We first compared the relationship between SIRE and GA, characterized via analysis of whole genome genotype data, and we then measured the extent to which PGx variation is partitioned by SIRE versus GA. SIRE and GA were found to be largely concordant, and GA provided only a marginal increase in PGx stratification compared to SIRE. Considering this finding together with the fact that patients’ SIRE is readily available to healthcare providers, we subsequently focused on the extent of PGx variation between SIRE groups. We provide a number of examples of PGx variants that are highly differentiated among SIRE groups and discuss the implications of these findings in light of population genetics and clinical decision-making.


Self-identified race/ethnicity (SIRE) and genetic ancestry (GA) in the US

We compared SIRE to GA for a cohort of 8628 individuals characterized as part of the Health and Retirement Study (HRS), for whom both SIRE information and whole genome genotypes were available (Table 1). HRS participants self-identified according to racial and ethnic labels defined by the US Government Office of Management and Budget (OMB). OMB defines five racial groups and two ethnic groups to assess disparities in health and environmental risks [35]. HRS participants were asked to select one or more race category and a single ethnic designation as Hispanic/Latino or not. We considered the race and ethnicity selections together and focused on the three largest categories in the HRS cohort: non-Hispanic White (5927; 68.7%), non-Hispanic Black (1527; 17.7%), and Hispanic/Latino of any race (1174; 13.6%). We refer to these three groups here as White, Black, and Hispanic. The percentages of each SIRE group in the HRS cohort resemble the demographics of the US: White = 72.4%, Black = 12.6%, and Hispanic = 16.3% [35].

Table 1 Demographic description for the cohort used in this study

Continental ancestry profiles were inferred for members of the HRS cohort by comparing their whole genome genotypes to whole genome sequence and genotype data for reference populations from Europe, Africa, and the Americas as described in the “Methods” section. Each HRS participant was assigned European, African, and Native American ancestry proportions, and the resulting ancestry profiles were then clustered into three distinct (non-overlapping) GA groups using k-means clustering. GA groups were defined without reference to SIRE group labels, using unsupervised clustering on continental ancestry fractions alone, and the choice to cluster ancestry profiles into three groups was made to allow for direct comparison with the three SIRE groups and in light of known patterns of continental ancestry in the US [36]. Permutation analysis was used to confirm the stability of the resulting GA groups and their robustness to changes in sample size (see Additional file 1: Figure S1). The distributions of continental ancestry fractions were compared for the three SIRE groups—White, Black, and Hispanic—and the three GA groups (Fig. 1).

Fig. 1

Race, ethnicity, and genetic ancestry in the US. Continental genetic ancestry patterns are shown for self-identified race/ethnicity (SIRE) and genetic ancestry (GA) groups: European ancestry (orange), African ancestry (blue), and Native American ancestry (red). HRS cohort participants are grouped by SIRE and GA, as described in the text, and continental ancestry fractions are compared for each grouping system. Top row: continental ancestry fractions for individuals organized into the three SIRE and three GA groups. Each column represents an individual genome, and the three continental ancestry fractions are shown for each individual column. Middle row: ternary plots showing the continental ancestry fractions for the SIRE and GA groups, as illustrated by the relative proximity to each of the three ancestry poles. Bottom row: average continental ancestry percentages for the SIRE and GA groups

The three objectively defined GA groups appear to correspond well to the SIRE groups, with respect to the distributions of individuals’ continental ancestry fractions (Fig. 1—top row). GA groups 1, 2, and 3 correspond to the White, Black, and Hispanic SIRE groups, respectively. The distributions of continental ancestry fractions for the SIRE and their corresponding GA groups are compared in Supplementary Figure 2 (see Additional file 1: Figure S2). Despite the apparent similarity between SIRE and GA, ternary plots underscore the broader distribution of ancestry fractions within SIRE groups compared to the non-overlapping GA groups delineated by k-means clustering (Fig. 1—middle row). This is especially true for the Hispanic group, consistent with the fact that it may include individuals who identify as any race. Overall, SIRE and the GA groups show similar average continental ancestry percentages: White/group 1 show ~ 99% European ancestry, Black/group 2 have ~ 82% African ancestry, and Hispanic/group 3 show predominantly European ancestry (~ 60%) with the highest levels of Native American ancestry (~ 37%) and the greatest variance in continental ancestry for any of the three groups. It should be noted that since our study cohort did not include individuals who identified as American Indian or Alaska Native, Native American ancestry here does not imply any tribal affiliation or community attachment.

The correspondence between the SIRE and GA groups was quantified by characterizing the overlap of membership assignments across the two groupings (see Additional file 1: Figure S3). Overall, individuals’ membership in the three SIRE and corresponding GA groups show 96.2% concordance. The highest concordance is seen for the White/group 1 pair, followed by Black/group 2, with Hispanic/group 3 showing the lowest concordance. The levels of concordance vary according to which grouping system is taken as the reference for comparison. This distinction is most obvious for the Hispanic/group 3 pairing: 96.6% of group 3 members self-identify as Hispanic, while only 77.1% of self-identified Hispanics fall into group 3.

Pharmacogenomic variation in the US

PGx variants that influence drug response were mined from the PharmGKB database, and levels of PGx variation were compared within and between the SIRE and GA groups defined for the HRS cohort. Results for SIRE group comparisons are shown in Fig. 2, and results for the analogous comparison of GA groups are shown in Supplementary Fig. 4 (see Additional file 1: Fig. S4). PGx variants show higher allele frequencies, higher allele frequency differences between groups, and higher levels of heterozygosity compared to non-PGx variants genome-wide (Fig. 2a–c). We considered group-specific differences in PGx variation in terms of the fixation index (FST), a commonly employed measure of population differentiation, and effect allele frequency differences. PGx FST and effect allele frequency difference values are highly correlated, as can be expected, and the largest differences are seen for the Black-White and Black-Hispanic group comparisons (Fig. 2d–f). Notably, even the most extreme values of FST fall well below 0.5, indicating the most PGx variation is found within rather than between SIRE groups. Nevertheless, there are 82 PGx variants that show statistically significant (FDR q < 0.05) values of allele frequency differentiation between any individual SIRE group and the other two groups, i.e., their complements (Fig. 2g and Supplementary Table 2). The significantly diverged PGx variants show an average FST value of 0.15 compared to 0.05 for the remaining variants (see Additional file 1: Figure S5). All-against-all pairwise distances for HRS participants were calculated using PGx variants and projected into two-dimensions with multi-dimensional scaling (MDS). K-means clustering was used to create three groups based on the PGx MDS distances, and individuals were labeled according to their SIRE (Fig. 2h). Genome-wide patterns of PGx variation characterized in this way show 96.1% correspondence to SIRE group labels (Fig. 2i).

Fig. 2

Pharmacogenomic variation in the US. Genome-wide average allele frequencies (a), group-specific allele frequency differences (b), and heterozygosity fractions (c) are shown for PGx variants (red) compared to non-PGx variants (blue). df Fixation index (FST; y-axis) and allele frequency differences (x-axis) for pairs of SIRE groups. Statistically significant PGx allele frequency differences are highlighted in black. g Heatmap showing group-specific allele frequencies for significantly diverged PGx variants. h Multi-dimensional scaling (MDS) plot showing the relationship among individual genomes as measured by PGx variants alone. Each dot is an individual HRS participant genome, and genomes are color-coded by participants SIRE. i The correspondence between SIRE groups and PGx groups defined by K-means clustering on the results of the MDS analysis. Data shown here correspond to SIRE groups; analogous results for GA groups are shown in Supplementary Figure 4 (see Additional file 1: Figure S4)

SIRE versus GA for partitioning pharmacogenomic variation

Given the overall correspondence, and group-specific differences, seen for SIRE and GA, we wanted to compare the utility of SIRE versus GA for partitioning pharmacogenomic variation in the US. Here, we asked two questions regarding PGx variation between groups: (1) are PGx allele frequencies correlated between SIRE and GA groups and (2) do GA groups partition PGx variation more so than SIRE groups? The first question was addressed by regressing PGx frequency differences between grouping systems (SIRE vs. GA groups), and the second question was addressed by considering the deviation of the regression from the unity line (i.e., the expected value under perfect correlation). As expected, given the observed similarities between SIRE and GA groups, PGx allele frequency differences are highly correlated when corresponding group pairs are compared (Fig. 3). The highest correlation is seen when the Black and White SIRE groups are compared to their corresponding GA groups. Comparisons that include the Hispanic SIRE group show lower levels of correlation.

Fig. 3

Self-identified race/ethnicity (SIRE) versus genetic ancestry (GA) for partitioning pharmacogenomic (PGx) variation. ac Regression of pairwise PGx variant effect allele frequency differences calculated using SIRE (y-axis) versus the corresponding GA groups (x-axis). Results of two statistical tests are shown for each of three pairwise group regressions. Test 1 evaluates whether SIRE and GA PGx allele frequencies are correlated, and test 2 evaluates that amount of additional resolution on PGx variant divergence that is provided by GA compared to SIRE. Details on each test are provided in the text

With respect to the second question regarding the partitioning of PGx variation, allele frequency differences between the Black/White SIRE groups and their corresponding GA groups fall almost entirely along the unity line; in this case, genetic ancestry does not provide any additional information regarding PGx variation (Fig. 3a). For both comparisons that include the Hispanic group however, the slope of the regression is less than one, indicating greater PGx allele frequency differences between GA groups compared to their corresponding SIRE groups (Fig. 3b, c). Thus, GA does provide more information than SIRE when ethnicity is considered, but the effect size of this difference is small (d = 2.5% for Black/group 2 vs. Hispanic/group 3 and d = 6.5% for Hispanic/group 3 vs. White/group 1).

Thus far, we have shown that SIRE and GA groups are highly concordant for the HRS cohort and that PGx allele frequency differences are similar for both classification systems. Since SIRE labels are routinely collected as patient provided information and are also readily available as part of electronic health records, we focused on PGx variation between SIRE groups to explore the potential clinical utility of race and ethnicity. We wanted to know whether PGx effect allele frequency differences of the magnitude observed here have any utility for guiding medication prescription decisions in light of the fact that the majority of PGx variation is found within rather than between SIRE groups. We considered the odds ratios for the apportionment of PGx risk alleles among individual SIRE groups and their complements as an indicator of SIRE groups’ predictive utility, given that odds ratios are widely used to associate categorical risk factors with health outcomes [37]. We also computed absolute risk increase values to account for the population frequency of PGx risk alleles when considering the magnitude of between group differences as well as the accuracy with which SIRE group membership predicts PGx alleles or genotypes. Detailed results for all PGx variants analyzed here can be found in Supplementary Table 3.

Examples of highly differentiated PGx variants are shown in Table 2 and Fig. 4. These examples were chosen as variants that had relatively high odds ratio values across different PGx effect types (dosage, efficacy, metabolism, and toxicity), highlighting instances for each of the three SIRE groups. The relative percentages of PGx effect (above) and non-effect (below) alleles across SIRE groups reveal the extent of differentiation for these variants (Fig. 4a), and the observed allele frequency differences are associated with SIRE group-specific continental ancestry fractions (Fig. 4b–d). Nevertheless, as described above and shown in Fig. 2, even highly differentiated PGx variants show levels of FST that indicate substantially more within than between group variation (see pie charts in Fig. 4b–d). Despite the relatively high levels of within group PGx variation, these variants show high group-specific odds ratios and substantial absolute risk increase values. In other words, HRS cohort members’ racial and ethnic self-identities carry substantial information that can be used to stratify pharmacogenomic risk at the population level. However, the accuracy levels with which group affiliations predict specific risk alleles or genotypes are only marginally high, indicating that SIRE has relatively less utility for individual-level risk prediction compared to risk stratification.

Table 2 Examples of highly differentiated PGx variants. This table lists some examples of highly diverged PGx variants in the three SIRE groups under consideration. In the table, “Ref. Pop.” refers to reference population, OR refers to odds ratios, and ARI refers to the absolute risk increase percentage. Values in brackets specify the 95% confidence intervals for each computation
Fig. 4

Examples of highly differentiated pharmacogenomic (PGx) variants. a SIRE group percentages of effect (above axis) versus non-effect (below axis) alleles/genotypes are shown for six highly differentiated PGx variants. Allele counts are used for the additive PGx effect mode, and genotype counts are used for the dominant effect mode. b, c The extent of within versus between group variation, ancestry associations, and PGx stratification/risk by SIRE groups are shown for three examples. Ancestry associations relate the ancestry fractions for individuals that bear distinct PGx genotypes: European (orange), African (blue), and Native American (red). Effect (blue) versus non-effect (gray) allele/genotype counts are compared for the group enriched for a specific PGx variant compared to the other two groups. Allele counts are shown for the additive PGx effect mode, and genotype counts are shown for the dominant mode. Group-specific allele/genotype counts were used to compute odds ratios and absolute risk increase values (risk stratification) along with group-specific prediction accuracy values (risk prediction) as shown

For example, the A allele of the PGx variant (rs1045642) in the ATP Binding Cassette Subfamily B Member 1 (ABCB1) gene is associated with a decreased fentanyl opioid dose requirement [38] (Fig. 4b). This PGx variant has a dominant mode of effect, such that patients with either the AA or GA genotype tend to metabolize fentanyl slower than patients with the GG genotype and will therefore require a lower dosage. 96.0% of variation for this PGx variant is partitioned within SIRE groups compared to 4.0% variation between groups. However, the dosage-associated genotypes are far more common in individuals who identify as White (OR = 3.3, CI = 3.0–3.6; ARI = 26.1%, CI = 24.0–28.3%), and from the ancestry association plot, it can be seen that the effect allele (A) is highly correlated with European genetic ancestry (β = 0.20, P = 1.95e−35). Self-identification as White predicts dosage-associated genotypes with 68.5% accuracy.

Similarly, a PGx variant (rs2500535) in the Uronyl 2-Sulphotransferase (UST) gene has been found to be associated with the efficacy of nortriptyline—an antidepressant—in patients with major depressive disorder [39] (Fig. 4c). This PGx variant has a dominant mode of effect; patients with the A allele are associated with a decreased improvement of depression symptoms when prescribed nortriptyline. These lower efficacy genotypes are more common in individuals who identify as Hispanic. Even though the variation at this genomic site is far higher within (93.5%) compared to between (6.5%) groups, the odds ratio for having risk-associated genotypes is high for the Hispanic population (OR = 6.07, CI = 5.44–6.82) along with a high absolute risk increase (ARI = 20.3%, CI = 18.5–22.2%). Hispanic ethnicity predicts nortriptyline efficacy-associated genotypes with 85.2% accuracy.

Another PGx variant (rs6977820) found in the Dipeptidyl Peptidase Like 6 (DPP) gene has been associated with adverse response to antipsychotic drugs (Fig. 4d). This PGx variant has an additive effect mode, whereby the T allele is positively correlated with African ancestry and associated with tardive dyskinesia among schizophrenia patients treated with antipsychotics [40]. When individuals that self-identify as Black are compared to the other two SIRE groups, most variation at this variant is found within (85.9%) rather than between (14.1%) groups. However, the odds ratio for the presence of the risk allele for adverse reaction to antipsychotics is high (OR = 7.7, 95% CI = 7.1–8.49), as is the absolute risk increase (ARI = 47.2%, 95% CI = 45.4%–48.9%), consistent with a substantially elevated risk of adverse drug reaction for the Black SIRE group compared to the others. Individuals who self-identify as Black can be predicted to have the effect-associated allele with 73.0% accuracy.

Clinical value of pharmacogenomic stratification by SIRE

We quantified the clinical utility of SIRE for partitioning PGx variation by comparing the ability to predict PGx effect alleles/genotypes before (pre) and after (post) stratification of the population by SIRE. The approach we used is equivalent to the comparison of pre- and post-test probabilities for diagnostic tests, where the test in this case is patient stratification by SIRE. For any given PGx variant, the pre-test probability is the overall population prevalence of the PGx effect allele/genotype, and the post-test probabilities are the group-specific positive predictive values (PPVs) for the PGx effect allele or genotype. Allele counts were used to compute these probabilities for PGx variants that show an additive effect mode, and genotype counts were used for the dominant effect mode. The absolute difference of the pre- and post-test probabilities calculated in this way was taken as a measure of the amount of information that is gained, with respect to PGx variant prediction for each specific group, when SIRE is used for patient stratification.

When highly differentiated PGx variants (Figs. 2g and 4) are analyzed in this way, the SIRE groups that show the highest effect allele frequencies for any given variant provide substantial additional information for PGx prediction. Considering the PGx variant (rs2500535) that is associated with nortriptyline efficacy (Fig. 4c), stratification by Hispanic identity yields an additional 14 individuals, for every 100 patients to be treated, who are predicted to show decreased improvement of symptoms related to depressive disorder. The information gain is even more extreme for the PGx variant (rs6977820) that is associated with antipsycohtic toxicity (Fig. 4d). For this variant, stratification of individuals that self-identify as Black will yield an additional 39 out of every 100 patients that are counter-indicated for the antipsychotic medications owing to toxic side effects. The overall levels of information gained via stratification by SIRE differ widely by group. Individuals that self-identify as Black show the highest levels of information gain for PGx variant prediction followed the Hispanic and White groups, respectively (Fig. 5). This pattern can be attributed to the relative numbers of individuals in each SIRE group together with the extent of genetic diversification seen between groups. The relatively high frequency of PGx effect alleles (Fig. 2a) also contributes to the amount of information gain observed here, given the fact that PPVs depend on the prevalence of the condition that is being tested (i.e., the presence of PGx effect alleles/genotypes).

Fig. 5

Information gained when SIRE is used for PGx stratification. The amount of information gained per 100 individuals is the number additional correct PGx variant predictions made when SIRE is used to stratify the population. Information gain is calculated for all PGx variants in each SIRE group, as described in the text, and the group-specific distributions are shown as density distributions and box-plots (inset): White (orange), Black (blue), and Hispanic (red)


Concordance between SIRE and GA in the US

The SIRE and GA groups from the US analyzed here show > 96% overall concordance (Fig. 1, also see Additional file 1: Figures S2 and S3). It must be stressed that these results only apply to the three major racial/ethnic groups covered by the ~ 8600 individual HRS cohort; nevertheless, the concordance between SIRE and GA seen for the HRS cohort is very much consistent with a number of previous studies of the US population. In 2005, investigators showed a 99.9% concordance between SIRE and genetically derived clusters for 3636 individuals from four racial/ethnic groups [41], and a 2007 study reported 100% classification accuracy of individuals from geographically separated population groups when thousands of genetic variants were used for clustering [42]. More recently, a study of > 11,000 cancer patients from The Cancer Genome Atlas found an 95.6% concordance between self-reported race (not ethnicity) and GA [43], and a study of > 200,000 individuals from the Million Veterans Program found > 99.4% concordance between SIRE and GA [44]. The latter two studies relied on machine learning classifiers powered by vectors of 7 and 30 ancestry principal components, respectively, whereas our clustering algorithm uses vectors of only three continental ancestry components to classify individual genomes. Additionally, the distribution of GA fractions observed here for the HRS cohort SIRE groups is consistent with previous studies [36, 45,46,47,48]. Taken together, our results and others underscore the extent to which continental ancestry patterns can distinguish SIRE groups in the US.

Genetic differences accumulate among populations when they are reproductively isolated, and isolation by distance [49] best accounts for the apportionment of human genetic diversity among global populations [50]. Populations that are physically distant, or separated by major geographic barriers, are more genetically diverged than nearby populations [51]. It follows that the appearance of population structure, i.e., distinct clusters of genetically related individuals, can represent an artifact of uneven sampling of human populations at extremes of distance [52]. For instance, isolation by distance can explain much of the apparent genetic structure observed for major genome sequencing projects such as the 1000 Genomes Project [53, 54] and the Human Genome Diversity Project [55, 56]. Conversely, when human populations are sampled more evenly across a range of distances, and in the absence of major geographical barriers, genetic diversity appears to be continuously distributed as a cline of variation [57, 58].

Isolation by distance can be taken to explain the concordance of the SIRE and GA groups observed for the HRS cohort, since the three major US SIRE groups are made up of individuals with ancestry from continental population groups—European, African, and Native American—that were isolated at great distances for tens-of-thousands of years before coming back together over the last 500 years [36, 45]. Since each SIRE group contains distinct patterns of continental ancestry, they correspond well to objectively defined clusters formed based on the partitioning of GA (Fig. 1, also see Additional file 1: Figure S2 and S3). In addition, despite the fact that these population groups are currently co-located within the US, assortative mating stands as an ongoing reproductive barrier among groups [59, 60] (but see below for an important caveat regarding this fact). It is nevertheless important to note that most of the SIRE and GA groups analyzed here are not composed of individuals with highly coherent ancestry patterns. Only the White/cluster 1 groups show coherent ancestry patterns, whereas the Black/cluster 2 and Hispanic/cluster 3 groups are made up of individuals that vary along a range of continental ancestry fractions (Fig. 1 and see Additional file 1: Figure S2). This is especially true for the Hispanic group, consistent with the fact Hispanic is an intentionally broad label that covers individuals from different races and with very distinct ancestry patterns [61].

An important caveat with respect to the high concordance between SIRE and GA observed here relates to the age of the individuals in the HRS cohort (Table 1). We chose to focus on older Americans given their disproportionate use of prescription medications [32], and HRS recruited participants aged 50 and over starting in 1992. The average age of the HRS cohort analyzed here is 57.5 years (CI 57.0–58.0), and all of the study participants were born before 1965, when there were still “anti-miscegenation” laws in nineteen states [62]. Rates of intermarriage among SIRE groups have increased substantially since that era [63], and as admixture continues to increase over time, the ancestral coherence of SIRE groups is expected to fall precipitously. Increased rates of immigration, coupled with the arrival of more globally diverse immigrant groups, will also blur boundaries between SIRE groups, potentially rendering the current labels clinically uninformative. Indeed, the most widely used SIRE labels in the US are mandated by the OMB, and they will likely be revised in the near future to better capture the increasing diversity of the US population. As such, the clinical relevance of SIRE will almost certainly decrease over time.

Within versus between group genetic divergence

It has long been appreciated that the vast majority of human genetic variation is found within rather than between populations. This fundamental result was first reported for worldwide racial groups, based on analysis of a handful of (surrogate) genetic markers [64], and has since been confirmed by numerous studies of populations defined by GA using larger-scale analyses [55, 65,66,67,68,69]. The distinction between this fundamental result and the high concordance seen for SIRE and GA, as well as the ability to cluster human population groups at various levels of relatedness, can be explained by the difference between univariate methods for variance partitioning versus multivariate classification methods [70, 71]. The analysis of PGx variation reported here is univariate, since we focus on the apportionment of variation for individual PGx variants, and we confirm that the majority of PGx variation is found within the HRS cohort groups (Figs. 2 and 4).

We used a standard evidence-based medicine analytical framework [37, 72] in an effort to understand the clinical relevance of PGx variation that is partitioned among SIRE groups in this way. In particular, we asked how the observed PGx differences between groups could be clinically relevant when the majority of variation falls within population groups, even for the most divergent variants found here. Despite the observed pattern of within versus between group PGx variation, we found numerous cases of high odds ratios and high absolute risk increases for the group-specific prevalence of PGx variants (Table 2 and Fig. 4). In other words, membership in any given SIRE group can entail substantially greater odds, and far higher risk, of carrying clinically relevant PGx variants compared to members of other groups. Information of this kind should be an important consideration for clinicians charged with making treatment decisions and could also be of value for well-informed patients.

Finally, it should be emphasized that humans are far more similar than they are different at the genomic level, both within and between population groups. As of August 2019, there were 674 million annotated single nucleotide variants among the ~ 3 billion sites in the human genome [73]. Thus, more than 75% of genomic positions are conserved among all human population groups, and for those positions that do vary, the majority are rare variants that segregate at < 1% frequency worldwide [54]. Nevertheless, the results reported here underscore the potential clinical relevance for those genetic variants that do show relatively high levels of between-group divergence.

Caveats and limitations

It is important to note that in this study, we measure the frequency of PGx variants across different SIRE and GA groups, rather than drug response differences per se. Even though the penetrance of PGx variants is generally high [2], clinical interpretations of variant frequency differences should be considered in light of variable penetrance levels as well. In cases of low penetrance, the magnitude of drug response differences between groups will be dampened. Furthermore, if PGx variants have different magnitudes of effect for different groups, i.e., group-specific effect sizes, then differences in drug response cannot be directly inferred from PGx variant frequency differences alone. However, since the majority of PGx variants are causative protein-coding variants [2], the likelihood of group-specific effect sizes is far lower than would be expected for non-coding variants discovered by genome-wide association studies, which are typically tag markers that are linked to nearby causative variants. Finally, the focus on single nucleotide variants (SNVs) is another limitation of the study, given the fact structural variants and multi-variant haplotypes have also been associated with inter-individual drug response differences. Nevertheless, the vast majority of PGx variants annotated in the PharmGKB database are SNVs [2], suggesting that our analytical approach captures most of the known variant-drug associations.


The current and future utility of race and ethnicity in pharmacogenomics

As previously noted, demographic trends in the US suggest that the clinical relevance of SIRE, including its predictive utility for PGx variation, is expected to continuously decrease over time. The increasing adoption of routine genetic testing for precision medicine could also render SIRE obsolete for stratifying PGx variation [74]. This is because genotyping of specific PGx variants will obviously provide far more accurate risk prediction than SIRE. For example, even a highly divergent PGx variant, like the antipsychotic toxicity associated variant rs6977820 (Fig. 4d), will yield a mis-prediction of the PGx risk allele 27% of the time if SIRE alone were used as a predictor. In this sense, the high group-specific PGx odds ratios and absolute risk increases observed in this study are best considered as surrogate guides to inform the optimal choice of prescribed medication, rather than precise diagnostic tools. In other words, SIRE categories provide valuable information for stratifying PGx risk at the population level but not for predicting individual-level PGx variants. Having said that, and despite the promise of population-scale genomic screening initiatives and biobanks [75], such as the NIH All of Us project [76], the day when all Americans will have ready access to their genetic profiles remains far in the future. Unfortunately, this is likely to be even more so for minority communities that are vastly underrepresented among clinical genetic cohorts [77, 78]. Until that time, SIRE will remain an important feature for clinicians to consider when making treatment decisions.

Perhaps most importantly, the current utility of SIRE is most apparent for groups who are underrepresented in biomedical research. Individuals who self-identify as Black or Hispanic stand to gain far more information with respect to precision treatment decisions than those who identify as White (Fig. 5). This finding can be attributed to the relative frequencies of individuals in each of the three SIRE groups analyzed here, which closely mirror the current demography of the US, and the extent of genetic divergence among groups. If a “one size fits all” approach to drug prescription is used, patients who identify as White are more likely to receive the most appropriate treatment, since their PGx variant frequencies will be closest to the overall population mean. Conversely, individuals who identify as Black or Hispanic have the most to lose if SIRE is not considered when making treatment decisions.


Study cohort

Self-identified race and ethnicity (SIRE) information and whole genome genotypes for Americans over the age of 50 and their spouses were collected as part of a nationally representative longitudinal panel study called the Health and Retirement Study (HRS) [34]. For the current study, only HRS participants with both SIRE and genotype information were considered (8912 participants). The 284 participants who did not identify with one of the three largest racial/ethnic categories in the HRS data—non-Hispanic White (5927), non-Hispanic Black (1527), and Hispanic/Latino of any race (1174)—were excluded from this analysis. This yielded a total of 8628 individuals in our final analysis cohort.

Genetic ancestry (GA) analysis

HRS participants were previously genotyped at ~ 2,381,000 genomic sites using the Illumina Omni2.5 BeadChip [34]. Whole genome genotype data from HRS participants were compared to reference populations from Europe, Africa, and the Americas in order to infer their continental genetic ancestry patterns as previously described [45] (see Additional file 1: Table S1) [54, 56, 79]. Reference populations were taken from (i) the 1000 Genomes Project (648) [54], (ii) the Human Genome Diversity Project (110) [56], and (iii) 21 Native American populations from across the Americas (90) [79]. A custom script that employs PLINK version 1.9 [80] was used to harmonize the HRS and reference population variant calls. The variant call data were merged by identifying the set of variants common to both datasets, with strand flips and variant identifier inconsistencies corrected as needed. The initial merged and cleaned variant data set was filtered for variants with > 1% missingness and < 1% minor allele frequency among samples. The final harmonized genotype data contains 228,190 genomic sites. The harmonized genotype dataset was phased using ShapeIT version 2.r837 [81]. ShapeIT was run without reference haplotypes, and all individuals were phased at the same time. Individual chromosomes were phased separately, and the X chromosome was phased with the additional “-X” flag.

A modified version of the RFMix program [45, 82] was used to characterize the continental genetic ancestry patterns for the HRS participants, with European, African, and Native American populations used as reference populations. RFMix was run in the “PopPhased” mode with a minimum node size of five, using 12 generations and the “—use-reference-panels-in-EM” for two rounds of EM, to assign continental ancestry for haplotypes genome-wide. Contiguous regions of ancestral assignment, “ancestry tracts,” were created where RFMix ancestral certainty was at least 95%, and genome-wide continental ancestry estimates for HRS participants were obtained by averaging across confidently assigned ancestry tracts.

Non-overlapping genetic ancestry (GA) groups were defined from individual participants’ continental ancestry estimates obtained via RFMix analysis using k-means clustering implemented in the Python package Scikit-learn [83] with k = 3. Each participant was represented as a point in three-dimensional (3-D) space, parameterized by their three continental ancestry fractions. Formally, the position of a participant (i) in this genetic ancestry space was defined by (Ei, Ai, Ni), where Ei, Ai, and Ni are the European, African, and Native American ancestry fractions. K-means clustering using Euclidean distances between all pairs of individual participants in this 3-D genetic ancestry space to yield three non-overlapping clusters. Given that k-means clustering can be unstable, the algorithm was run on these data 100 times and the most probable group membership was assigned to each participant. This method allowed us to define three non-overlapping groups of HRS participants informed entirely by their genetic ancestry and free from the social dimensions of SIRE.

The association between GA and PGx variant genotypes was measured using our previously described method [25]. To obtain the strength of association (β) between continental ancestry proportions and genotypes, continental ancestry fractions were regressed against the observed PGx variant genotypes. Formally, the genetic ancestry fraction y = βx + ε, where x {0, 1, 2} refers to the number of PGx variant effect alleles. The significance of these ancestry associations was quantified using a t test.

Measurement of PGx variation

Single nucleotide variants (SNVs) associated with pharmacogenomic response—i.e., PGx variants—were mined from the Pharmacogenomic Knowledgebase (PharmGKB) [2]. This online database is a source of manually curated clinical variant annotations for PGx variants and their associated drug-response phenotypes. Data on the chromosomal locations of PGx variants, the identity of PGx effect (risk) alleles, PGx variants’ mode of effect (additive or dominant), clinical annotations, and clinical evidence levels were parsed and taken for analysis. A total of 2351 PGx variants were accessed from PharmGKB, 989 of which were genotyped for the HRS cohort. Only directly genotyped PGx variants were used for analysis. PharmGKB annotates the specific effect alleles that are associated with inter-individual differences in drug dosage, efficacy, metabolism, and toxicity. The direction of effect (higher or lower) is specific to individual PGx variants for dosage, efficacy, and metabolism whereas toxicity effect alleles always correspond to increased toxicity.

PGx allele frequencies for SIRE and GA groups were computed as the group-specific counts of effect alleles normalized by the total number of typed individuals for each group. Pairwise between group fixation index (FST) values for each variant were computed by calculating two components: (i) the mean expected heterozygosity within subpopulations, \( {\overline{H}}_S=\frac{1}{2}\ \sum \limits_i2\left({p}_i\right)\left(1-{p}_i\right)\left(\frac{{\mathrm{count}}_i}{\mathrm{total}\ \mathrm{count}}\right) \), where pi is the frequency of risk allele in population i, counti is the number of individuals in population i, and total count refers to the total number of individuals in both populations, and (ii) the expected heterozygosity in the total population, \( {H}_T=2\left(\overline{p}\right)\left(1-\overline{p}\right) \), where \( \overline{p} \) is the mean effect allele frequency in both populations under consideration. The fixation index was computed by combining the two computed metrics as \( {F}_{ST}=1-\frac{{\overline{H}}_S}{H_T} \) [84]. PGx variants were used to calculate pairwise inter-individual distances for all HRS participants using PLINK, and the resulting distance matrix was projected into two dimensions using multi-dimensional scaling (MDS) with the mds function in R. K-means clustering of the participants in MDS space was used to generate three non-overlapping PGx variant groups in the same way as described for the GA groups.

Odds ratios (ORs) were calculated for group-specific PGx effect allele counts [37]. In a contingency table for the counts of effect allele in population PA with the four values: PE (effect allele count in PA), PN (non-effect allele count in PA), QE (effect allele count in non-PA individuals), and QN (non-effect allele count in non-PA individuals), this was done using the formula \( OR=\frac{P_E/{Q}_E}{Q_N/{Q}_N} \), with confidence intervals calculated as CI = exp(log(OR) ± Zα/2 × SElog(OR)), where α is 0.05, Zα/2 is 1.6, and \( \mathrm{S}{\mathrm{E}}_{\log \left(\mathrm{OR}\right)}=\sqrt{\frac{1}{P_E}+\frac{1}{P_N}+\frac{1}{Q_E}+\frac{1}{Q_N}} \). Similarly, using group-specific PGx effect counts the absolute risk increase (ARI) was calculated as \( \mathrm{ARI}=\frac{P_E}{P_E+{P}_A}-\frac{Q_E}{Q_E+{Q}_A} \), with confidence intervals calculated as CI = ARI ± Zα/2 × SEARI, where α is 0.05, Zα/2 is 1.96, and \( \mathrm{S}{\mathrm{E}}_{\mathrm{ARI}}=\sqrt{P_E{P}_A+{Q}_E{Q}_A\ } \)[72]. Group-specific genotype prediction accuracy values were calculated as Accuracy = (TP + TN)/(TP + TN + FP + FN), where TP is true positives, TN is true negatives, FP is false positives, and FN is false negatives. TP, TN, FP, and FN designations are assigned based on the SIRE group that shows enrichment for PGx effect allele (or genotype). The presence of the PGx effect allele in the implicated SIRE group is counted as a true positive, whereas its presence in the other groups is counted as a false positive. Conversely, the presence of the PGx non-effect allele in the implicated SIRE group is counted as a false negative, whereas its presence in the other groups is counted as a true negative. Accuracy confidence intervals are calculated as \( \mathrm{CI}=\mathrm{Accuracy}\pm {Z}_{\alpha /2}\times \sqrt[2]{\frac{\mathrm{Erro}{\mathrm{r}}_{\mathrm{prediction}}}{1-\mathrm{Erro}{\mathrm{r}}_{\mathrm{prediction}}}/N} \), where \( \mathrm{Erro}{\mathrm{r}}_{\mathrm{prediction}}=\frac{FP+ FN}{TP+ TN+ FP+ FN} \) and N = TP + TN + FP + FN. As noted before, when α is 0.05, Zα/2 is 1.96.

Pre- and post-test probabilities were compared in order to compute the amount of information gained per 100 individuals based on PGx stratification with SIRE. For any given PGx variant, the pre-test probability is calculated as the overall population prevalence of the PGx effect allele (additive mode) or genotype (dominant mode): Prevalenceoverall = CountEA/CountTotal, where CountEA is the count of the effect allele/genotype in the cohort and CountTotal is the total count of alleles/genotypes at that locus in the cohort. The post-test probability is calculated as the group-specific positive predictive values (PPVs) for the PGx effect allele or genotype. PPV is calculated as follows: \( \mathrm{PP}{\mathrm{V}}_A={\mathrm{Count}}_{EA}^A/{\mathrm{Count}}_{\mathrm{Total}}^A \), where \( {\mathrm{Count}}_{EA}^A \) is the count of the effect allele/genotype in population A and \( {\mathrm{Count}}_{\mathrm{Total}}^A \) is the total count of alleles/genotypes at that locus in the population A. Information gain is then calculated as: InfoGainA =   PPVA − Prevalenceoverall.

Comparison of SIRE and GA

To test whether PGx variant allele frequencies were correlated between SIRE and GA, pairwise PGx variant allele frequency differences calculated for SIRE groups were regressed against allele frequency differences calculated for GA groups. Here, the null hypothesis is H0 : β = 0, while the alternate hypothesis is HA : β ≠ 0. The significance of this correlation was testing using a t test where t = (βobs − βexp)/SE and P = P(TDF ≤ βexp). Next, we tested whether GA groups partition PGx variation more than SIRE groups using the same regression. For this test, the null hypothesis is H0 : β = 1, while the alternate hypothesis is HA : β < 0. An underlying assumption for this one-tailed test is that GA groups should hold more information about PGx allele frequency differences when compared to SIRE groups. We calculated the difference in the expected (unity line) and observed (SIRE versus GA) regression slopes, d = (βexp − βobs)/2 to quantify the magnitude of the effect. A denominator of 2 was chosen to reflect the entire range of possible slopes that the data may take—going from −1, where SIRE groups reflect exactly the opposite difference in allele frequencies, to 1, where SIRE groups faithfully and completely capture the allele frequency differences observed in GA groups. The statistical significance was tested using a t test as described above.

Availability of data and materials

The data that support the findings of this study are available from Health and Retirement Study but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Health and Retirement Study.


  1. 1.

    Evans WE, Relling MV. Pharmacogenomics: translating functional genomics into rational therapeutics. Science. 1999;286(5439):487–91.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  2. 2.

    Barbarino JM, Whirl-Carrillo M, Altman RB, Klein TE. PharmGKB: a worldwide resource for pharmacogenomic information. Wiley Interdiscip Rev Syst Biol Med. 2018;10(4):e1417.

    PubMed  PubMed Central  Article  Google Scholar 

  3. 3.

    Yasuda SU, Zhang L, Huang SM. The role of ethnicity in variability in response to drugs: focus on clinical pharmacology studies. Clin Pharmacol Ther. 2008;84(3):417–23.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  4. 4.

    Huang SM, Temple R. Is this the drug or dose for you? Impact and consideration of ethnic factors in global drug development, regulatory review, and clinical practice. Clin Pharmacol Ther. 2008;84(3):287–94.

    PubMed  Article  PubMed Central  Google Scholar 

  5. 5.

    Chen ML. Ethnic or racial differences revisited: impact of dosage regimen and dosage form on pharmacokinetics and pharmacodynamics. Clin Pharmacokinet. 2006;45(10):957–64.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  6. 6.

    Bjornsson TD, Wagner JA, Donahue SR, Harper D, Karim A, Khouri MS, et al. A review and assessment of potential sources of ethnic differences in drug responsiveness. J Clin Pharmacol. 2003;43(9):943–67.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  7. 7.

    Ramamoorthy A, Pacanowski MA, Bull J, Zhang L. Racial/ethnic differences in drug disposition and response: review of recently approved drugs. Clin Pharmacol Ther. 2015;97(3):263–73.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  8. 8.

    Bachtiar M, Lee CG. Genetics of population differences in drug response. Curr Genet Med Rep. 2013;1(3):162–70.

    Article  Google Scholar 

  9. 9.

    Risch N, Burchard E, Ziv E, Tang H. Categorization of humans in biomedical research: genes, race and disease. Genome Biol. 2002;3(7):comment2007.

  10. 10.

    Cooper RS, Kaufman JS, Ward R. Race and genomics. N Engl J Med. 2003;348(12):1166–70.

    PubMed  Article  PubMed Central  Google Scholar 

  11. 11.

    Caulfield T, Fullerton SM, Ali-Khan SE, Arbour L, Burchard EG, Cooper RS, et al. Race and ancestry in biomedical research: exploring the challenges. Genome Med. 2009;1(1):8.

    PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    Burchard EG, Ziv E, Coyle N, Gomez SL, Tang H, Karter AJ, et al. The importance of race and ethnic background in biomedical research and clinical practice. N Engl J Med. 2003;348(12):1170–5.

    PubMed  Article  PubMed Central  Google Scholar 

  13. 13.

    Montagu A. Man’s most dangerous myth: the fallacy of race. Lanham: Rowman & Littlefield; 1997.

    Google Scholar 

  14. 14.

    Graves JL Jr. The emperor’s new clothes: biological theories of race at the millennium. New Brunswick: Rutgers University Press; 2003.

    Google Scholar 

  15. 15.

    Saini A. Superior: the return of race science. Boston: Beacon Press; 2019.

    Google Scholar 

  16. 16.

    Lee SS, Mountain J, Koenig BA. The meanings of “race” in the new genomics: implications for health disparities research. Yale J Health Policy Law Ethics. 2001;1:33–75.

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Braun L. Reifying human difference: the debate on genetics, race, and health. Int J Health Serv. 2006;36(3):557–73.

    PubMed  Article  PubMed Central  Google Scholar 

  18. 18.

    Gannett L. The biological reification of race. Brit J Philos Sci. 2004;55(2):323–45.

    Article  Google Scholar 

  19. 19.

    Ackerman R, Athreya S, Bolnick D, Fuentes A, Lasisi T, Lee SH, et al. AAPA statement on biological aspects of race. Am J Phys Anthropol. 1996;101(4):569–70.

    Article  Google Scholar 

  20. 20.

    Graves JL Jr. Evolutionary versus racial medicine: why it matters. In: Krimsky S, Sloan K, editors. Race and the genetic revolution: science, myth, and culture. New York: Columbia University Press; 2011. p. 142–70.

    Google Scholar 

  21. 21.

    Graves JL Jr. Why the nonexistence of biological races does not mean the nonexistence of racism. Am Behav Sci. 2015;59(11):1474–95.

    Article  Google Scholar 

  22. 22.

    Graves JL Jr. Great is their sin: biological determinism in the age of genomics. Ann Am Acad Political Soc Sci. 2015;661(1):24–50.

    Article  Google Scholar 

  23. 23.

    Graves JL Jr. Biological theories of race beyond the millenium. In: Suzuki K, Von Vacano DA, editors. Reconsidering race: social science perspectives on racial categories in the age of genomics. Oxford: Oxford University Press; 2018. p. 21–31.

    Google Scholar 

  24. 24.

    Yudell M, Roberts D, DeSalle R, Tishkoff S. Taking race out of human genetics. Science. 2016;351(6273):564–5.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  25. 25.

    Nagar SD, Moreno AM, Norris ET, Rishishwar L, Conley AB, O'Neal KL, et al. Population pharmacogenomics for precision public health in Colombia. Front Genet. 2019;10:241.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  26. 26.

    Ahsan T, Urmi NJ, Sajib AA. Heterogeneity in the distribution of 159 drug-response related SNPs in world populations and their genetic relatedness. PLoS One. 2020;15(1):e0228000.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  27. 27.

    Hariprakash JM, Vellarikkal SK, Keechilat P, Verma A, Jayarajan R, Dixit V, et al. Pharmacogenetic landscape of DPYD variants in south Asian populations by integration of genome-scale data. Pharmacogenomics. 2018;19(3):227–41.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  28. 28.

    Lakiotaki K, Kanterakis A, Kartsaki E, Katsila T, Patrinos GP, Potamias G. Exploring public genomics data for population pharmacogenomics. PLoS One. 2017;12(8):e0182138.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  29. 29.

    Bonifaz-Pena V, Contreras AV, Struchiner CJ, Roela RA, Furuya-Mazzotti TK, Chammas R, et al. Exploring the distribution of genetic markers of pharmacogenomics relevance in Brazilian and Mexican populations. PLoS One. 2014;9(11):e112640.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  30. 30.

    Ramos E, Doumatey A, Elkahloun AG, Shriner D, Huang H, Chen G, et al. Pharmacogenomics, ancestry and clinical decision making for global populations. Pharmacogenomics J. 2014;14(3):217–22.

    CAS  PubMed  Article  Google Scholar 

  31. 31.

    Mizzi C, Dalabira E, Kumuthini J, Dzimiri N, Balogh I, Basak N, et al. A European spectrum of pharmacogenomic biomarkers: implications for clinical pharmacogenomics. PLoS One. 2016;11(9):e0162866.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  32. 32.

    Kirzinger A, Neuman T, Cubanski J, Brodie M. Prescription drugs and older adults San Francisco: Kaiser Family Foundation; 2019 [Available from:

  33. 33.

    Bureau UC. Quick facts: United States; 2010.

    Google Scholar 

  34. 34.

    Sonnega A, Faul JD, Ofstedal MB, Langa KM, Phillips JW, Weir DR. Cohort profile: the health and retirement study (HRS). Int J Epidemiol. 2014;43(2):576–85.

    PubMed  PubMed Central  Article  Google Scholar 

  35. 35.

    Humes KR, Jones NA, Ramirez RR. Overview of Race and Hisapnic Origin Washington, DC: US Census Bureau; 2011 [Available from:

  36. 36.

    Bryc K, Durand EY, Macpherson JM, Reich D, Mountain JL. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am J Hum Genet. 2015;96(1):37–53.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  37. 37.

    Bland JM, Altman DG. Statistics notes. The odds ratio. BMJ. 2000;320(7247):1468.

  38. 38.

    Lotsch J, von Hentig N, Freynhagen R, Griessinger N, Zimmermann M, Doehring A, et al. Cross-sectional analysis of the influence of currently known pharmacogenetic modulators on opioid therapy in outpatient pain centers. Pharmacogenet Genomics. 2009;19(6):429–36.

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  39. 39.

    Uher R, Perroud N, Ng MY, Hauser J, Henigsberg N, Maier W, et al. Genome-wide pharmacogenetics of antidepressant response in the GENDEP project. Am J Psychiatry. 2010;167(5):555–64.

    PubMed  Article  PubMed Central  Google Scholar 

  40. 40.

    Tanaka S, Syu A, Ishiguro H, Inada T, Horiuchi Y, Ishikawa M, et al. DPP6 as a candidate gene for neuroleptic-induced tardive dyskinesia. Pharmacogenomics J. 2013;13(1):27–34.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  41. 41.

    Tang H, Quertermous T, Rodriguez B, Kardia SL, Zhu X, Brown A, et al. Genetic structure, self-identified race/ethnicity, and confounding in case-control association studies. Am J Hum Genet. 2005;76(2):268–75.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  42. 42.

    Witherspoon DJ, Wooding S, Rogers AR, Marchani EE, Watkins WS, Batzer MA, et al. Genetic similarities within and between human populations. Genetics. 2007;176(1):351–9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Yuan J, Hu Z, Mahal BA, Zhao SD, Kensler KH, Pi J, et al. Integrated analysis of genetic ancestry and genomic alterations across cancers. Cancer Cell. 2018;34(4):549–60 e9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  44. 44.

    Fang H, Hui Q, Lynch J, Honerlaw J, Assimes TL, Huang J, et al. Harmonizing genetic ancestry and self-identified race/ethnicity in genome-wide association studies. Am J Hum Genet. 2019;105(4):763–72.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  45. 45.

    Jordan IK, Rishishwar L, Conley AB. Native American admixture recapitulates population-specific migration and settlement of the continental United States. PLoS Genet. 2019;15(9):e1008225.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  46. 46.

    Bryc K, Velez C, Karafet T, Moreno-Estrada A, Reynolds A, Auton A, et al. Colloquium paper: genome-wide patterns of population structure and admixture among Hispanic/Latino populations. Proc Natl Acad Sci U S A. 2010;107(Suppl 2):8954–61.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  47. 47.

    Bryc K, Auton A, Nelson MR, Oksenberg JR, Hauser SL, Williams S, et al. Genome-wide patterns of population structure and admixture in West Africans and African Americans. Proc Natl Acad Sci U S A. 2010;107(2):786–91.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  48. 48.

    Baharian S, Barakatt M, Gignoux CR, Shringarpure S, Errington J, Blot WJ, et al. The great migration and African-American genomic diversity. PLoS Genet. 2016;12(5):e1006059.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  49. 49.

    Wright S. Isolation by distance. Genetics. 1943;28(2):114–38.

    CAS  PubMed  PubMed Central  Google Scholar 

  50. 50.

    Cavalli-Sforza LL. The history and geography of human genes. Princeton: Princeton University Press; 1994.

    Google Scholar 

  51. 51.

    Rosenberg NA, Mahajan S, Ramachandran S, Zhao C, Pritchard JK, Feldman MW. Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet. 2005;1(6):e70.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  52. 52.

    Serre D, Paabo S. Evidence for gradients of human genetic diversity within and among continents. Genome Res. 2004;14(9):1679–85.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  53. 53.

    Genomes Project C, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65.

    Article  CAS  Google Scholar 

  54. 54.

    Genomes Project C, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74.

    Article  CAS  Google Scholar 

  55. 55.

    Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, et al. Genetic structure of human populations. Science. 2002;298(5602):2381–5.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  56. 56.

    Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319(5866):1100–4.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  57. 57.

    Prugnolle F, Manica A, Balloux F. Geography predicts neutral genetic diversity of human populations. Curr Biol. 2005;15(5):R159–60.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  58. 58.

    Handley LJ, Manica A, Goudet J, Balloux F. Going the distance: human population genetics in a clinal world. Trends Genet. 2007;23(9):432–9.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  59. 59.

    Domingue BW, Fletcher J, Conley D, Boardman JD. Genetic and educational assortative mating among US adults. Proc Natl Acad Sci U S A. 2014;111(22):7996–8000.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  60. 60.

    Schwartz CR. Trends and variation in assortative mating: causes and consequences. Annu Rev Sociol. 2013;39:451–70.

    Article  Google Scholar 

  61. 61.

    Mora GC. Making Hispanics: how activists, bureaucrats, and media constructed a new American. Chicago: University of Chicago Press; 2014.

    Google Scholar 

  62. 62.

    Newbeck P. Virginia hasn’t always been for lovers: interracial marriage bans and the case of Richard and Mildred loving. Carbondale: Southern Illinois University Press; 2008.

    Google Scholar 

  63. 63.

    Wang W. The rise of intermarriage Washingtoin, DC: Pew Research Center; 2012 [Available from:

  64. 64.

    Lewontin RC. The apportionment of human diversity. In: Dobzhansky TH, Hecht MK, Steere WC, editors. Evolutionary biology. New York: Springer; 1972. p. 381–98.

    Google Scholar 

  65. 65.

    Barbujani G, Di Benedetto G. Genetic variances within and between human groups. Genes Fossils Behaviour. 2001:63–77.

  66. 66.

    Brown RA, Armelagos GJ. Apportionment of racial diversity: a review. Evol Anthropol. 2001;10(1):34–40.

    Article  Google Scholar 

  67. 67.

    Excoffier L, Hamilton G. Comment on “Genetic structure of human populations”. Science. 2003;300(5627):1877; author reply.

  68. 68.

    Long JC, Kittles RA. Human genetic diversity and the nonexistence of biological races. Hum Biol. 2003;75(4):449–71.

    PubMed  Article  PubMed Central  Google Scholar 

  69. 69.

    Ruvolo M, Seielstad M. The apportionment of human diversity: 25 years later. In: Singh RS, Krimbas CB, Paul DB, Beatty J, editors. Thinking about Evolution: Historical, Philosophical, and Political Perspectives: Cambridge: Cambridge University Press; 2001. p. 141–51.

  70. 70.

    Edwards AW. Human genetic diversity: Lewontin's fallacy. Bioessays. 2003;25(8):798–801.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  71. 71.

    Rosenberg NA. Variance-partitioning and classification in human population genetics. Phylogenetic Inference, Selection Theory, and History of Science: Selected Papers of AWF Edwards with Commentaries. Cambridge: Cambridge University Press; 2018. p. 399–403.

  72. 72.

    Altman DG, Andersen PK. Calculating the number needed to treat for trials where the outcome is time to an event. BMJ. 1999;319(7223):1492–5.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  73. 73.

    Team d. NCBI dbSNP 2019 [Available from: ChanKey=dbsnpnews.

  74. 74.

    Ng PC, Zhao Q, Levy S, Strausberg RL, Venter JC. Individual genomes instead of race for personalized medicine. Clin Pharmacol Ther. 2008;84(3):306–9.

    CAS  PubMed  Article  Google Scholar 

  75. 75.

    Abul-Husn NS, Kenny EE. Personalized medicine and the power of electronic health records. Cell. 2019;177(1):58–69.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  76. 76.

    All of Us Research Program I, Denny JC, Rutter JL, Goldstein DB, Philippakis A, Smoller JW, et al. The “All of Us” research program. N Engl J Med. 2019;381(7):668–76.

    Article  Google Scholar 

  77. 77.

    Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016;538(7624):161–4.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  78. 78.

    Petrovski S, Goldstein DB. Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine. Genome Biol. 2016;17(1):157.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  79. 79.

    Reich D, Patterson N, Campbell D, Tandon A, Mazieres S, Ray N, et al. Reconstructing Native American population history. Nature. 2012;488(7411):370–4.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  80. 80.

    Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  81. 81.

    Delaneau O, Zagury JF, Marchini J. Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods. 2013;10(1):5–6.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  82. 82.

    Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet. 2013;93(2):278–88.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  83. 83.

    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python 2011;12(Oct):2825–2830.

  84. 84.

    Hudson RR, Slatkin M, Maddison WP. Estimation of levels of gene flow from DNA sequence data. Genetics. 1992;132(2):583–9.

    CAS  PubMed  PubMed Central  Google Scholar 

Download references


The authors thank Dr. Greg Gibson for the comments on a manuscript draft and Dr. Joseph Lachance for helpful discussion.


SDN was supported by the Georgia Tech Bioinformatics Graduate Program. ABC and IKJ were supported by the IHRC-Georgia Tech Applied Bioinformatics Laboratory.

Author information




SDN designed and was responsible for all data analyses. ABC performed genetic ancestry analysis. SDN and IKJ were responsible for the interpretation of results and discussion. SDN and IKJ prepared figures and wrote the manuscript. IKJ conceived of, designed, and supervised the project. All authors read and approved the final manuscript.

Corresponding author

Correspondence to I. King Jordan.

Ethics declarations

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Table S1, Figures S1–4.

Additional file 2.

Table S2

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nagar, S.D., Conley, A.B. & Jordan, I.K. Population structure and pharmacogenomic risk stratification in the United States. BMC Biol 18, 140 (2020).

Download citation


  • Population genomics
  • Pharmacogenomics
  • Precision public health
  • Genetic ancestry
  • Race
  • Ethnicity