KIRCLE workflow description
KIRCLE is an allele inference algorithm that uses aligned WES data in the form of a BAM or CRAM file to generate probability estimates for each KIR allele, as well as genotype predictions for each KIR gene. KIRCLE consists of 4 major steps: pre-processing, local alignment with BLAST, bootstrapped expectation-maximization, and thresholding (Fig. 1a).
-
1)
In pre-preprocessing, KIRCLE first extracts all WES reads that map to the genomic coordinates of the KIR genes on chromosome 19q13.4 and writes these reads to fifteen separate files—one for each KIR gene.
-
2)
Next, KIRCLE uses nucleotide BLAST to perform local alignment on each KIR gene’s collection of reads against a database of variants belonging to that particular KIR gene. In the IPD-KIR Database v2.8.0, 908 different alleles spanning the 15 KIR genes are documented, of which 535 represent distinct coding variants. KIRCLE then filters out alignments with less than 100% identity matches to documented KIR alleles.
-
3)
KIRCLE then bootstraps the BLAST-identified alignments with 100% identity matches to KIR alleles and uses an expectation-maximization (EM) algorithm, with convergence hyperparameter α, to generate allele probability estimates from these collections of alignments (Fig. 1c). n bootstraps of fraction p of all 100%-identity alignments are computed in this manner. The bootstrapped allele probability estimates are then averaged together to determine a final probability estimate for each allele. This bootstrapping is helpful in countering the EM algorithm’s tendency to converge to local minima representing homozygous solutions based on small differences in initial alignment data.
-
4)
Finally, KIRCLE uses a thresholding algorithm to convert each KIR gene’s set of allele probability estimates into homozygous or heterozygous genotype calls, depending on the number of alleles that exceeded a heuristically determined threshold t (Fig. 1b). Depending on the user’s selection of hyperparameter t and the EM algorithm’s outputted allele probabilities, the resulting genotype solution may be either homozygous, heterozygous, or non-existent.
Final workflow outputs from KIRCLE include a table of allele probability estimates, a table of genotype calls, and a list of runtime hyperparameters.
Hyperparameter determination and validation for KIRCLE
KIRCLE requires the use of 4 hyperparameters: α (the convergence threshold for expectation-maximization), n (the number of bootstraps to perform), p (the proportion of reads to use in each bootstrap), and t (the threshold used to convert KIR allele probabilities to binary KIR genotype calls). Of these, choices regarding p and t represent the greatest and most direct potential sources of variability in KIRCLE’s accuracy. Using one randomly selected sample from UK Biobank, we were able to characterize KIRCLE’s performance, as measured by the Shannon entropy of the inferred genotypes, across different values of p (from 0.2 to 0.8) and t (from 0.05 to 0.40). At each set of hyperparameters tested, we performed 500 iterations of KIRCLE on one arbitrarily selected sample in UK Biobank, collected the 500 genotype outputs, and empirically computed the log2 Shannon entropy of the genotype solutions for each KIR gene. An ideal genotype caller would be consistent and call the same solution for the same input, resulting in a “genotype-entropy” of 0. For many KIR genes, such as KIR2DL1 and KIR2DL4, contour maps of the resulting entropies revealed that KIRCLE was largely self-consistent, with little variability of output (genotype-entropy of 0) across a wide spectrum of hyperparameter values (Fig. 2a, b). This pattern was recapitulated in the majority of KIR genes, suggesting respectable consistency of KIRCLE output across multiple KIR genes (Supplementary Fig. S1a–j). For all subsequent analyses in this manuscript, hyperparameter values of α=1e−5, n=100, p=0.5, and t=0.25 were used.
Next, to establish the accuracy of our algorithm, we assessed the concordance of KIRCLE-generated genotypes between TCGA biological replicates. Of 10,332 exomes in TCGA, 1,062 were present twice as biological replicates and thus used in this analysis. We determined that 85.8% of genotype solutions called by KIRCLE across all KIR genes were concordant between replicates (Supplementary Fig. S2a). We defined solutions to be concordant if the genotype inferred by KIRCLE in one sample was identical to that inferred in its replicate. Genes with the highest concordance between replicates were KIR2DS2 (98.3%), KIR3DL1 (92.1%), and KIR3DS1 (92.1%), whereas genes with the lowest concordance between replicates were KIR2DL2 (77.0%), KIR2DS5 (78.9%), and KIR3DL3 (81.0%) (Fig. 2c). Given the observed difference between KIRs with the highest concordance (e.g., KIR2DS2) and those with the lowest concordance (e.g., KIR2DL2), we sought to explain these differences in accuracy by analyzing the degree of sequence similarity between different alleles of each KIR. We assessed sequence similarity by performing a multiple sequence alignment (MSA) on all alleles of a KIR using Clustal Omega [13] and then measuring the mean phylogram distance between all coding variants (Supplementary Fig. S2b–c). We noted a strong positive correlation (Spearman’s ρ=0.631) between the mean distance between alleles of a given KIR and the observed concordance between that KIR’s genotype calls among TCGA biological replicates, indicating that the higher levels of sequence similarity among alleles of KIRs such as KIR2DL2, KIR2DS5, and KIR3DL3 are likely to account for their lower rates of observed concordance.
Finally, we investigated whether KIRCLE is robust against differences in depth of sequencing. To do so, we compared the ambiguity of KIRCLE’s output, quantified as the Shannon entropy of generated KIR allele probabilities, across TCGA samples with different depths of sequencing. Low ambiguity in KIR allele calling results in KIR allele probabilities of either 1 for a single allele and 0 for all other alleles (reflecting a homozygous genotype) or 0.5 for two different alleles and 0 for all other alleles (reflecting a heterozygous genotype), leading to entropies of either 0 or 1 respectively. Conversely, high ambiguity in KIR allele calling will lead to a more uniform distribution of KIR allele probabilities, leading to entropies higher than 1. For each TCGA sample, we measured both the average coverage and the KIR-allele-probability entropies for each KIR gene. Binning samples by their average coverages, we observed that allele probability entropies—and thus the ambiguity of KIR allele probabilities—are notably increased only at very low coverages (<20 average depth of coverage at the KIR gene locus). Furthermore, as negative controls, 20 “pseudo-BAMs” were generated by randomly sampling reads mapping to KIR gene loci from 50 randomly selected BAMs in TCGA. Pseudo-BAMs were generated with an average read depth commensurate with their constituent BAMs. After applying KIRCLE to these pseudo-BAMs, their resulting allele probability entropies were much higher (median=1.70; IQR 0.958–2.61) than a significant majority of actually observed entropies for all TCGA samples, regardless of the depth of coverage (Fig. 2d). Moreover, despite differences in average depth of coverage at different KIR gene loci, average KIR allele entropies between different KIR genes largely remained constant (Supplementary Fig. S2b). Overall, KIRCLE demonstrated a high level of consistency while being able to call a diverse set of KIR genotype solutions and is robust to the effects of low depth of coverage.
KIR allele comparisons between TCGA and UK Biobank
After benchmarking KIRCLE using internal quality control metrics, we assessed KIRCLE’s performance by comparing its allele predictions in TCGA to its allele predictions in UK Biobank. We first compared the frequencies of different KIR alleles in TCGA with their frequencies in UK Biobank among Caucasian individuals in both datasets, in order to mitigate race as a confounding variable affecting observed allele frequency. For each KIR gene, we ranked its alleles by frequency in both TCGA and UK Biobank and then computed the Spearman correlation coefficient between the allele frequencies in the two datasets (Fig. 3a). We noted that all KIR genes displayed positive correlation coefficients and that the vast majority of KIR genes demonstrated highly similar distributions of allele frequencies between TCGA and UK Biobank (Spearman’s ρ=0.802). Direct comparison of all KIR alleles ranked by frequency also demonstrated high consistency between the two cohorts’ Caucasian subpopulations (Fig. 3b). Both UK Biobank and TCGA are largely composed of Caucasians (81.4% and 93.3% of the individuals analyzed in TCGA and UK Biobank respectively).
Additionally, we were able to further validate observed KIR allele frequencies for certain KIR genes using allele frequency data from the US National Marrow Donor Program (NMDP), as reported by the Allele Frequency Net Database [14]. We used the NMDP dataset because the subjects in this cohort are predominantly Caucasians, similar to the TCGA patients. For KIR2DL4, the four most frequent KIR2DL4 alleles reported by the NMDP were KIR2DL4*001, KIR2DL4*008, KIR2DL4*005, and KIR2DL4*011 (34.6%, 30.2%, 19.9%, and 11.6% respectively). We were able to recapitulate these four alleles as the most frequent KIR2DL4 alleles in both TCGA and UK Biobank’s Caucasian subpopulations, albeit in a different order for each dataset (Fig. 3c). In TCGA, KIR2DL4*005 was the most frequent allele, followed by KIR2DL4*001, KIR2DL4*008, and KIR2DL4*011 (44.4%, 23.5%, 14.4%, and 7.55%). In UK Biobank, this order was reversed with KIR2DL4*011 being the most frequent allele, followed by KIR2DL4*008, KIR2DL4*001, and finally KIR2DL4*005 (32.4%, 21.0%, 18.0%, and 13.9%). Further validation of allele frequencies against the NMDP was also performed for the alleles of KIR3DL2. The most frequent KIR3DL2 allele in a population of 75 Caucasians was KIR3DL2*002 (26.1%), followed by KIR3DL2*001 and KIR3DL2*007 (21.0% and 18.8%) [15]. While KIR3DL2*002 was found at similarly high frequencies in both TCGA (9.37%) and UK Biobank (9.27%) as the 4th and 3rd most frequent KIR3DL2 alleles respectively, KIR3DL2*001 and KIR3DL2*007 were much lower ranked at 9th and 13th in TCGA and 8th and 1st in UK Biobank respectively. However, these are still fairly well-represented alleles at 4.95% and 2.78% frequency in TCGA and 5.00% and 13.3% frequency in UK biobank respectively. Furthermore, considered overall, KIR3DL2 allele frequency ranks in TCGA and UK Biobank still demonstrate positive correlations with the allele frequency ranks observed in the NMDP (Supplementary Fig. S2c). Despite slight numerical differences, confirmation of the status of the most frequent alleles in these two KIR genes increases our confidence in KIRCLE’s ability to infer KIR alleles from WES data accurately.
In addition to validating population frequencies of KIR alleles, we also examined patterns of KIR allele co-expression and dependence. As KIRCLE assesses for the presence of 535 KIR alleles over 15 KIR genes, the KIR genotype of each sample in TCGA and UK Biobank may be represented as a point in 535-dimensional “KIR-space.” We first used t-distributed stochastic neighbor embedding (t-SNE) to perform dimensionality reduction and thus visualize the distribution of individuals in TCGA in 2 dimensions [16]. When we colored this t-SNE map using individuals’ SNP-inferred ethnicities [17], we observed that different ethnicities cluster together and are non-uniformly distributed (Fig. 3d). In particular, African Americans and—to a lesser extent—Asian Americans in TCGA formed clusters that were often very distinct from the Caucasian majority. Similar analyses performed in UK Biobank recapitulated this non-random distribution of KIR genotypes and confirmed the non-uniform distribution and clustering of those who self-identified their ethnicity as “Black” or “Asian” (Fig. 3e). Of particular note, the “Asian” population in TCGA comprises those of East Asian descent, whereas the “Asian” population in UK Biobank largely comprises those of South Asian descent (with major subcategories of Indians, Pakistanis, and Bangladeshis). However, both groups of Asians clustered distinctly and separately from the Caucasian majority to some extent in both datasets.
KIR allele associations with HLA alleles
As it is known that HLA and KIR bind to each other in an allele-specific way, we posited that strong correlations may also exist between KIR alleles and HLA alleles on the population level, due to a known co-evolution event in humans [18]. Using HLA types imputed by the HLA*IMP:02 algorithm and subsequently released by UK Biobank [19], we observed 326 significantly associated pairs of KIR alleles with HLA alleles in the UK Biobank data (Fig. 4a, b, Supplementary Fig. S3a). Many of these associations belonged to a set of particularly common HLA alleles (e.g., HLA-B*53) or KIR alleles (e.g., KIR3DL3*005). Furthermore, we also note that the majority (73.0%) of significant associations are positive. We speculate that these associations reflect changes in direct physical interactions between HLA and KIR alleles, which result in co-selection due to an advantageous increase in fitness for individuals with these combinations of KIR and HLA alleles. Particularly visually striking examples of positive and negative associations between KIRs and HLAs include KIR3DL3*005 with HLA-A*74 (Fisher’s exact FDR=7.43e−43; odds ratio=55.1) and KIR2DL3*002 with HLA-A*36 (Fisher’s exact FDR=1.71e−12; odds ratio=0.0727), respectively (Fig. 4c). Additionally, when examining the t-SNE coordinates of individuals with HLA alleles such as HLA-B*42, we observed a non-uniform distribution and clustering of these samples that closely mirrors the distribution of samples when labeled by ethnicity (Fig. 4d).
While these findings may support the biological link between these two classes of molecules and shed additional light onto which particular HLA alleles may have evolved in parallel with particular KIR alleles, they also raise the possibility that our observed associations are driven by population stratification according to ethnicity. In order to disentangle the effects of this stratification on associations between HLA and KIR alleles, we re-attempted the analysis using only Caucasian individuals in UK Biobank, while testing only KIR alleles with >1% allele frequency in UK Biobank (Fig. 4e). While this analysis unveiled a much smaller subset of HLA-KIR associations, we noted 3 significant associations: HLA-C*17 with KIR2DS4*016 and HLA-B*41 with KIR2DL4*011 and KIR2DS4*016. Notably, both KIR2DS4 and KIR2DL4 have NK-cell-activating activity, and all three are affiliated with a negative odds ratio. These results indicate that HLA-C*17 and B*41 could be true activation ligands for KIR2DS4 and KIR2DL4, and their interactions may induce NK responses that impose negative selection pressure on individuals bearing both alleles.
Although TCGA is a much smaller dataset than UK Biobank, we were able to use TCGA to discover a smaller set of correlations between HLA alleles and KIR alleles after filtering out KIR alleles with <1% allele frequency in TCGA to improve our Bonferroni correction factor (Supplementary Fig. S3b). HLA allele calls for samples in TCGA were made using POLYSOLVER [20]. In particular, KIR2DL2*003, KIR3DL2*013, and KIR3DL3*008 were strongly positively associated with HLA-B*46, HLA-B*53, and HLA-C*15 respectively at the FDR < 0.25 level. The HLA-B*53 association with KIR3DL2*013, notably, was the most significant HLA-KIR association discovered in UK Biobank. However, when we re-attempted the analysis using only Caucasian individuals in TCGA to eliminate population stratification by ethnicity as a potential confounding factor, all significant associations between KIR and HLA alleles disappeared after Bonferroni correction. In summary, after correction of population stratifications, we found few significant associations between activating the KIR gene and HLA alleles. The absence of significant associations between inhibitory KIR genes and HLA alleles might suggest weaker selective pressure for KIR alleles, possibly due to the multiple redundant mechanisms inhibiting NK cell activation [21].
KIR allele associations with clinical correlates
In addition to correlations with HLA alleles, we searched for KIR allele correlations with clinical features. We first examined KIR allele correlations with individuals’ medical diagnoses documented in UK Biobank, as encoded by the 10th revision of the International Statistical Classification of Diseases (ICD10). To minimize the number of under-powered tests we performed, we attempted correlations only with KIR alleles represented at over 1% frequency in UK Biobank. Additionally, we excluded all diseases primarily associated with external causes, including accidents, injuries, and nutritional deficiencies, as well as obstetric and psychiatric diseases among others. Of note, this list of exclusions includes infectious diseases, which despite having a strong biological basis for association with KIR alleles, require exposure to a pathogen, which is largely driven by individuals’ environmental circumstances. Strikingly, the only associations that remained significant at the FDR < 0.25 level were those associated with sickle-cell anemia (ICD10 D57) or with uterine leiomyomas (ICD10 D25), both diseases that disproportionately affect black people [22]. However, positing a direct biological mechanism behind these associations likely would represent a third-cause fallacy, as blacks are statistically more likely to possess both KIR alleles enriched in black populations as well as either the sickle-cell trait or uterine leiomyomas.
Thus, we next narrowed our analysis to investigate only those individuals who self-identified as Caucasian. While the vast majority of correlations failed false-discovery-rate correction, we discovered a significant correlation between the KIR3DL3*080 allele and ICD10 K25—peptic ulcer disease (PUD) (Fisher’s exact test, FDR= 0.0429; Fig. 5a). Whereas those without KIR3DL3*080 had merely a 1.04% chance of being diagnosed with PUD, patients with KIR3DL3*080 had a 2.90% chance of being diagnosed with PUD, representing a 2.8-fold increase in likelihood (Fig. 5b). No significant association was found between KIR3DL3*080 and usage of ibuprofen, which would predispose individuals toward developing PUD (data not shown). Thus, if KIR3DL3*080 predisposes an individual toward PUD, it likely does so through an alternative mechanism. Significant correlations with ICD10 diagnosis codes in Black and Asian populations were not observed, likely owing to the lower statistical power these smaller populations had.
Additionally, we explored correlations between KIR alleles with population frequency >1% and other clinical correlates besides ICD10 codes. When examining correlations with age of onset of several chronic diseases and conditions, we discovered that KIR3DL2*107 was highly correlated with early age of onset of hay fever, rhinitis, or eczema in Caucasian individuals. Whereas individuals without KIR3DL2*107 had an average age of onset of 24.7 years (IQR 12–35 years), those with at least one copy of KIR3DL2*107 had an average age of onset of 22.4 years (IQR 11–30 years; two-sided Mann-Whitney FDR=0.0751; Fig. 5c). Moreover, an alternative allele of KIR3DL2, KIR3DL2*062, was weakly associated with an increase in age of onset of hay fever, rhinitis, or eczema from 24.5 years (IQR 12–35 years) to 27.0 years (IQR 14–40 years; Mann-Whitney FDR=0.244; Fig. 5d). Later onset of these conditions was particularly pronounced in individuals with two copies of KIR3DL2*062, with an average age of onset of 27.6 years (IQR 14–40 years). Together, these results suggest that polymorphisms in KIR3DL2 may play a key role in determining the age of onset of hay fever, rhinitis, and/or eczema.
Moreover, hay fever and eczema, in conjunction with allergic asthma, more broadly represent manifestations of atopy, the genetic predilection to trigger IgE-mediated (type I) hypersensitivity reactions following allergen exposure with increased TH2-driven responses [23]. Thus, we next attempted to generalize this association to encompass atopy more broadly by examining KIR3DL2*107 and KIR3DL2*062’s associations with age of onset of either asthma or hay fever, rhinitis, or eczema, using the age of onset of whichever condition occurred earliest in life for each individual. We observed the same association: individuals with at least one copy of KIR3DL2*107 had an average age of onset of 15.9 years (IQR 6–20.75 years), whereas those without any copies of KIR3DL2*107 had an average age of onset of 19.0 years (IQR 8–28 years; Mann-Whitney p=0.012; Fig. 5e). Simultaneously, individuals with at least one copy of KIR3DL2*062 (22.5 years; IQR 10.25–31.5 years), and particularly those with two copies of KIR3DL2*062 (23.1 years; IQR 10.75–32.75 years), had later onsets of atopic reactions than those without KIR3DL2*062 (18.7 years; IQR 7–27 years; Mann-Whitney p=0.019; Fig. 5f). Together, these findings suggest a potential biological mechanism either delaying or hastening onset of atopic reactions like hay fever, eczema, or asthma that involves KIR3DL2, and the KIR3DL2*107 and KIR3DL2*062 alleles in particular. In addition to atopic reactions, we also observed significant associations of KIR alleles with other clinical correlates, including dental and oral health, quantitative blood analysis, and waist circumference, suggesting potentially broad impact of natural killer functions in affecting diverse human traits (Supplementary Fig. 4).
Finally, to further explore the effects of KIR3DL2 polymorphism on age of atopy onset, we posited that each of the two aforementioned KIR3DL2 alleles follows either a dominant, semi-dominant, or recessive model of expression and then sought to determine which of these three models best explains the effect of KIR3DL2 genotype on age of atopy onset. In the recessive model, only a genotype homozygous for the KIR3DL2 allele in question contributes to a change in age of atopy onset from baseline. In contrast, in the dominant model, genotypes either homozygous or heterozygous for the KIR3DL2 allele in question contribute to changes in baseline age of atopy onset. Finally, in the semi-dominant model, homozygotes for the KIR3DL2 allele in question are twice as potent as corresponding heterozygotes in changing age of atopy onset from baseline. When assessed against each other using the UK Biobank data, the dominant model outperformed semi-dominant and recessive models of expression for KIR3DL2*107, as measured by the Bayesian information criterion (−10.8145 versus −10.8151 and −10.8172, respectively). Meanwhile, expression patterns of KIR3DL2*062 instead favored the recessive model over the semi-dominant and dominant models of expression for KIR3DL2*062, as measured by the Bayesian information criterion (−10.8138 versus −10.8141 and −10.8146, respectively). In summary, our analysis indicated that KIR3DL2*107 may “override” other alleles and thus present with a dominant phenotype, whereas KIR3DL2*062 may be weaker than other KIR3DL2 alleles and thus present with a recessive phenotype.