Viral diversity is an obligate consideration in CRISPR/Cas9 designs for targeting the HIV reservoir

Background RNA-guided CRISPR/Cas9 systems can be designed to mutate or excise the integrated HIV genome from latently infected cells and have therefore been proposed as a curative approach for HIV. However, most studies to date have focused on molecular clones with ideal target site recognition and do not account for target site variability observed within and between patients. For clinical success and broad applicability, guide RNA (gRNA) selection must account for circulating strain diversity and incorporate the within-host diversity of HIV. Results We identified a set of gRNAs targeting HIV LTR, gag, and pol using publicly available sequences for these genes and ranked gRNAs according to global conservation across HIV-1 group M and within subtypes A–C. By considering paired and triplet combinations of gRNAs, we found triplet sets of target sites such that at least one of the gRNAs in the set was present in over 98% of all globally available sequences. We then selected 59 gRNAs from our list of highly conserved LTR target sites and evaluated in vitro activity using a loss-of-function LTR-GFP fusion reporter. We achieved efficient GFP knockdown with multiple gRNAs and found clustering of highly active gRNA target sites near the middle of the LTR. Using published deep-sequence data from HIV-infected patients, we found that globally conserved sites also had greater within-host target conservation. Lastly, we developed a mathematical model based on varying distributions of within-host HIV sequence diversity and enzyme efficacy. We used the model to estimate the number of doses required to deplete the latent reservoir and achieve functional cure thresholds. Our modeling results highlight the importance of within-host target site conservation. While increased doses may overcome low target cleavage efficiency, inadequate targeting of rare strains is predicted to lead to rebound upon cART cessation even with many doses. Conclusions Target site selection must account for global and within host viral genetic diversity. Globally conserved target sites are good starting points for design, but multiplexing is essential for depleting quasispecies and preventing viral load rebound upon therapy cessation. Electronic supplementary material The online version of this article (10.1186/s12915-018-0544-1) contains supplementary material, which is available to authorized users.


Background
Despite the success of combination antiretroviral therapy (cART) in suppressing HIV viremia, reservoirs of latently infected cells remain the major barrier for HIV cure [1]. The HIV latent reservoir is composed of long-lived infected cells harboring replication-competent proviruses with limited transcription that can reactivate and reseed the reservoir upon cART interruption [2,3]. A promising therapeutic strategy for achieving cure involves depleting the reservoir by direct disruption of proviral genomes using engineered DNA-editing enzymes such as CRISPR/Cas9 nucleases. A growing body of research shows that endonuclease-induced mutation of essential viral genes or excision of provirus can render the virus unable to replicate [4][5][6][7][8][9][10][11][12]. If performed on a large scale, this approach could yield pharmacologically significant reservoir reduction. However, viral reservoirs are highly diverse, even in well-suppressed individuals [13,14], and this diversity remains a major challenge for the application of genome editing strategies towards an HIV cure. Effective targeting of all viral genetic variants within an infected individual will be crucial for achieving sufficient reservoir reduction to prevent viral rebound upon cART cessation [15,16] and preventing the emergence of resistance to this therapy [11].
Thus far, studies used to demonstrate the viability of gene-editing strategies against HIV have primarily targeted single molecular clones that provide ideal endonuclease target site recognition [7,8]. Multiple classes of gene-editing enzymes have been studied, but the CRISPR/ Cas9 system has gained popularity in recent years due to its effectiveness, relative simplicity, and ease of use. Several computational tools now exist to identify CRISPR target sites, to predict the activity of guide RNAs (gRNAs) targeting those sites, and to identify and score gRNAs based on multiple factors including predicted off-target activity [17][18][19]. However, no available tools allow guide selection based on predicted target site conservation or predicted clinical efficacy based on viral diversity. The identification and characterization of the most conserved target sites on a group-or subtype-specific basis will allow rapid selection of gRNAs when deep sequencing of a patient's reservoir is not practical or feasible. Furthermore, because the virus can evolve resistance to endonuclease targeting [11], multiple sites may need to be targeted concurrently in order to prevent the emergence of resistance. Therefore, the selection of multiplexed sets of gRNAs must account for the diversity of circulating strains across a wide range of infected people, and dosing strategies must consider within-host diversity of HIV to maximize the probability of a functional cure.
Here, we present a CRISPR gRNA design strategy that selects target sites not only by predicted efficacy and specificity but also by prevalence in the population. We first created a database of highly conserved target sites in HIV LTR, gag, and pol focusing on group-and subtype-level conservation using information about the global sequence diversity of HIV. We used this database to identify highly conserved target site pairs and triplets to create multiplex gRNA designs predicted to maximize targeting and reduce the probability of treatment resistance. From this analysis, we identified and tested 59 LTR guides using a fluorescent reporter to quantify activity in vitro. We then used deep-sequence data from HIV-infected individuals to determine within-host target site conservation and probability of cleavage by individual gRNAs in our list. Finally, we used a mathematical model to predict the number of doses that would be required to achieve functional cure thresholds, while accounting for varying levels of target site diversity and enzyme efficacy.

Results
Broadly targeting spCas9 gRNAs against HIV gag, pol, and LTR We performed a screen to identify globally conserved target sites for Streptococcus pyogenes (spCas9) in LTR, gag, and pol using alignments for these regions obtained from the HIV LANL database. LTR was chosen for its utility in excision of the provirus [8,20,21], while gag and pol were chosen based on their conservation between HIV strains [22]. The publicly available LANL alignments contain HIV sequences from thousands of infected persons (from about 1200 for LTR to more than 8000 for pol) and include strain and geographic information. From these alignments, we computed majority consensus sequences for LTR, gag, and pol of HIV-1 group M and subtypes A-C. We identified a total of 246 unique gRNA target sites in LTR, 573 in gag, and 897 in pol. For each target site identified, we determined the number of exact hits in the overall alignment of all group M sequences and for each subtype and ranked target sites by overall prevalence (Fig. 1). Target sites were found to be most conserved in pol (Table 1), where a single target site was present in up to 86.5% (n = 4416) of all group M sequences. The most conserved target sites in LTR and gag occurred in up to 70.6% (n = 1216) and 71.1% (n = 8435), respectively, of group M sequences.
We determined predicted on-target cleavage efficiency and off-target activity for each guide sequence (Fig. 2) using the sgRNA designer tool [17]. Predicted on-target activity scores were in the range [0,1] where a score of 1 was associated with successful knockout in the experiments of Doench et al. [17,23] and gRNAs with scores < 0.2 were generally excluded because they were shown to be predictive of poor activity. Mean predicted activity scores across all identified guides were 0.50 (SD 0.12, n = 246) for LTR, 0.49 (SD 0.13, n = 573) for gag, and 0.47 (SD 0.13, n = 897) for pol. From the list of gRNAs identified, we excluded 10 from gag and 26 from pol from further analyses due to high predicted off-target activity scores. No significant correlation was observed between predicted activity and target site conservation (Additional file 1: Table S1A).

Multiplexed gRNA designs
For each gene, we determined the number of sequences that could be targeted by pairs and triplets of gRNAs in group M overall, and in each subtypes A-C (Table 1). We determined that just two strategically selected a b c Fig. 1 Top 100 gRNA target sites in HIV LTR (a), gag (b), and pol (c) ranked by prevalence (bottom to top) within an alignment of available sequences within group M for each genomic region. The x-axis shows the percentage of all sequences in group M that contain an exact match to the target site. Within each horizontal bar, shading indicates what percentage of sequences with target sites hits belong to each subtype. Inset bar plots show the total number of sequences of each subtype in the alignment  Fig. 2 a Histogram of predicted activity of all gRNAs identified in LTR, gag, and pol across all four consensus sequences (group M, subtypes A-C) for each gene. b Predicted activity score vs. target site conservation for individual gRNAs grouped by subtype and gene. Red triangles indicate gRNAs excluded due to predicted off-target activity. Numbers in blue represent the total number of guides with predicted activity score > 0.2 and where target sites occur in more than 50% of sequences in the group or subtype alignment gRNAs are sufficient for targeting 100% of LTR and pol sequences in the current global alignment for subtype A, and three gRNAs are able to target over 98% of all sequences in subtypes A-C. However, when considering all group M sequences, the maximum percentage of sequences targeted by triplet sets of gRNAs drops to 88.8% for LTR, 95.5% for gag, and 99.2% for pol (Table 1 and Additional file 1: Table S2). The two most conserved LTR sites in the whole of group M (ranks 1 and 2) were also the most prevalent target sites in the individual subtypes, but this was not the case for gag and pol (Additional file 1: Table S2).
Overall, better coverage of group M or subtypes A-C sequences was achieved when pair or triplet gRNAs targeted pol, suggesting that pol is an ideal therapeutic target for targeted mutagenesis with multiplexed guide RNAs. We determined that a minimum set of eight gRNA target sites would be required to guarantee that every pol sequence in the group M global was targeted at least once.

Functional testing of selected gRNAs
From our list of 246 gRNAs targeting LTR, we identified 59 gRNAs for functional testing by first considering the most conserved target sites in group M and each subtype. We then included any gRNAs that increase the number of sequences targeted when combined in pairs or triplets with the previous list (Additional file 2: Figure S1A). In order to test the activity of these guides in vitro, we designed LTR-GFP fusion reporter constructs using consensus sequences for group M and subtypes A-C (Fig. 3a, Additional file 2: Figure S1B). We tested the ability of each gRNA to knock down reporter GFP expression in HEK293 cells following co-transfection with a plasmid expressing spCas9 mCherry containing each HIV-specific gRNA and the LTR-GFP fusion reporter. The activity of each gRNA was measured in terms of percent knockdown of median GFP fluorescence intensity relative to negative controls at 24 h post-transfection in Cas9 expressing (mCherry positive, Additional file 2: Figure S1C) cells.
We compared measured gRNA activity to predicted activity scores from the sgRNA designer (Fig. 3b); there was a trend towards weak positive correlation between predicted and measured activity (Pearson's r = 0.25, n = 59, 95% CI = 0.00-0.48, Additional file 1: Table S1B). We observed a reduction of GFP fluorescence intensity with 52 out of 59 gRNAs (Fig. 3c, Additional file 1: Table S4), with a maximum knockdown of 76.3% (mean = 15.3%, SD = 16.0%, n = 59). Maximum knockdown was achieved at target site CAAAGACTGCTGACACAGAAGGG, which was identified in the consensus sequence of subtype C and found to occur in 23.1% of group M sequences and 68.4% of subtype C sequences in the 2016 LANL alignment. We observed clustering of the most active guides within the LTR; target sites for gRNAs with GFP knockdown > 30% were found at positions 74-75, 319-344, and 446 relative to the start of the 5′ LTR. Although some active guides appear to coincide with regions of high-residue conservation within the LTR (Fig. 3c), we found no significant correlation between GFP knockdown and target site prevalence within all available sequences in Group M (Pearson's r = − 0.03, n = 59, 95% CI = − 0.28-0.23, Additional file 1: Table S1C). a c b Fig. 3 a LTR-GFP fusion reporter to test gRNAs for activity in vitro. b Activity was measured in terms of percent knockdown of median GFP fluorescence intensity relative to negative controls. We found positive but statistically non-significant correlation between computationally predicted activity scores and measured activity. c We achieved reduction of GFP fluorescence intensity (positive activity) with a majority of gRNA designs and observed clustering of tested target sites in two areas of the LTR with the most active guides being clustered around the center of the LTR. With a small number of gRNAs, we observed negative activity (increase in GFP fluorescence). Lower panel shows residue conservation (in 0-2 bits) across the LTR for alignments of subtype sequences or all sequences in group M

In silico testing of candidate gRNAs on within-host patient sequences
In order to simulate the application of this gene-editing approach on a diverse within-host virus population, we used a published dataset of HIV sequences obtained from HIV-infected blood donors in Brazil [24], focusing on the pol gene (because it is the most highly conserved) for 10 patients. We started with our list of all pol target sites that we identified above from group and subtype consensus sequences from 2016 LANL alignments, labeling each target site according to the consensus sequence it was identified from (300, 317, 304, and 328 target sites from group M and subtype A-C consensus sequences, respectively, 1249 sites total, 897 unique sites). From this combined list of globally conserved target sites, we determined whether each site was present in each patient's HIV consensus sequence (Additional file 1: Tables S5 and S6) [24]. Across infected persons, an average of 89.4 group M target sites (i.e., 29.80% of all group M sites identified) and 119.9 subtype B sites (39.44% of all subtype B target sites identified) were found to be also present within patient consensus sequences (SD 11.14 sites/3.24% and 9.84 sites/3.71%, respectively, n = 10 patients), while subtype A and C sites were identified less frequently (Fig. 4a). Since subtype B is highly prevalent in Brazil, this was not surprising. Five target sites were found to be present in all 10 patient consensus sequences (Additional file 1: Table S6), and one of these (GATGGCAGGTGATGATTGTGTGG) was also highly conserved in the global alignment for subtype B (present in 87.09% of LANL sequences). These five target sites were found to occur between positions 2294 and 2981 in pol. In addition, we identified gRNA target sites directly from the patient's consensus sequence. The number of directly identified sites for each patient ranged between 276 and 313 (mean = 299.30, SD = 10.83, n = 10). Out of 1712 unique sites generated from the 10 patients' consensus sequences, 351 were present in our list of globally conserved sites.
Of the remaining sites, 1135 were only present in a single individual and 87 sites were found in more than 5 individuals. With one exception (GTTTCTTGCCCTGT CTCTGCTGG), every site that was present in more than 5 individuals was also present in our global list. Next, we used deep-sequence data from each of these individuals [24] to determine the degree of conservation of each target site within the patient's virus quasispecies population. In order to accurately quantify rare target site variants, we identified 4 out of 10 patient datasets where mean coverage across all identified target sites was above 5000× (Additional file 1: Table S2, Additional file 3: Figure S2B). For each of these patients, we determined within-host target site conservation by computing the percentage of reads in the alignment containing an exact match to the site. Within-host target site conservation was found to vary dramatically for individual gRNAs and between individual patients, ranging between 5.5 and 95.6% with a mean of 83.5% (SD 14.3%, n = 2298) (Fig. 4b).
Within-host target site conservation was an average of 3.4% higher for sites identified from our global list (range of means = 84.7-86.5%, n = 4 patients) compared to sites that were only present in the patient's sequence (mean = 81.6%, n = 4, p = 0.026), but the difference between groups was not statistically significant (F test, p = 0.15). Target sites identified from group M or subtype B consensus sequences tended to be more conserved than sites identified from the patient sequence, but the differences were not statistically significant (both 3.7% higher, with p = 0.087 and p = 0.054, respectively). Within-host a b Fig. 4 a Number of previously identified target sites from global consensus sequences of group M and subtypes A-C that were present in each patient's HIV consensus sequence. b Within-host target site conservation for each identified target site using deep-sequence data for 4 patients, summarized using box plots. Black dots indicate outlier target sites (outside 1.5 × IQR), and target sites are grouped and colored according to which consensus sequence they were identified from (the group-or subtype-level consensus from LANL alignments, or from the patient's HIV consensus sequence) target site conservation was nearly identical using group M or subtype B sites (p = 0.98). All p values were > 0.1 after multiple test corrections.

Modeling reservoir depletion with CRISPR-based therapy
We developed a mathematical model to understand the effect of experimentally controllable parameters on reservoir depletion with hypothetical weekly dosing of various candidate CRISPR/Cas9 therapies targeting HIV. The model simulates the clearance of the latent reservoir by including many (up to 10 4 ) quasispecies carrying replication-competent DNA. These species are unevenly abundant and are assumed to follow a log-normal distribution so that each quasispecies contains 1-1000 members. Further, each quasispecies is cleared from the reservoir so that the total reservoir clearance rate recapitulates the experimentally measured reservoir half-life of 3-4 years [25,26]. In the absence of CRISPR therapy, the model simulates a fluctuating but, on average, slowly decaying HIV reservoir with varying compositions [27]. We then simulated reservoir clearance with varying enzyme efficacy (ϵ, the probability of successful mutagenic DNA cleavage at the target site) and varying coverage proportion (ρ, the proportion of sequences that would respond to enzyme). The measure of target site conservation was based on our analysis of patient samples. Parameter ranges for ϵ were based on ranges of predicted cleavage efficiency from the sgRNA designer tool (Fig. 2) and measured activity (Fig. 3) described above. Including CRISPR, our simulations suggest that treatments with gRNAs targeting a single site will be insufficient to achieve functional cure even at high levels of target site conservation and enzyme efficacy (Fig. 5a, Additional file 4: Figure S3). Enzyme efficacy is relatively unimportant in this case, only affecting the number of treatments needed to remove the sensitive quasispecies. Once removed, additional treatments provide no additional benefit and insensitive quasispecies dominate the reservoir (Fig. 5a). However, if 100% coverage of all quasispecies can be achieved through the selection of a multiplexed set of gRNAs that can be delivered simultaneously, the number of treatments to deplete the reservoir to the first cure threshold (100-fold decrease [16]) can be achieved in 1-5 treatments depending on efficacy (Fig. 5c), whereas the second threshold (2000-fold decrease [15]) may require 5-10 treatments depending on efficacy. For all modeled assumptions, coverage is vital to reservoir depletion. Whereas suboptimal efficiency can be surmounted by repeated doses, the diversity of the reservoir constitutes the largest barrier to depletion.

Discussion
Gene editing using CRISPR/Cas9 has the potential to effect a functional cure for HIV through targeted mutagenesis or proviral genome excision [28]. This approach has now been demonstrated in multiple proof-of-concept in vitro and in vivo studies [7, 9-12, 20, 29, 30]. While laboratory demonstration of gRNA activity has largely relied on clonal populations of lab-adapted HIV strains, clinical applications of this method will need to consider the wide intra-and inter-host diversity of HIV. The global diversity of HIV-1 is reflected in the classification of viruses into four broad groups (M, N, O, and P) that are 25-40% a b Fig. 5 Simulated reservoir depletion with anti-HIV CRISPR therapy. a Example simulation based on predicted target site conservation ("potency," ρ = 0.5) and enzyme efficacy to each target site (ϵ = 0.5). CRISPR therapy is dosed weekly, and the average strain contains 100 infected cells (μ s = 100). Thin colored lines represent single strains, L s (t), and the thick black line represents the total reservoir, L(t) = ∑ s L s (t). Strains targeted by CRISPR are cleared rapidly, but untargeted strains remain unaffected and the total reservoir size does not decrease below estimated depletion thresholds for functional cure. The dashed line represents a stringent threshold for latent reservoir reduction where patients are expected to remain suppressed for years without cART [15,16]. See Additional file 4: Figure S3 for simulations varying all parameters. b If 100% coverage (ρ = 1) of target sites can be achieved (either through multiplexing of targets or due to a target site that is highly conserved), enzyme efficacy becomes relevant, dictating the number of doses to cure. At or better than predicted efficacy ϵ > 0.5, doses range between 1 and 5 doses for a median 1 year remission and 5-10 doses for a potentially lifelong absence of viral rebound based on previously estimated thresholds. However, even for 100% coverage, efficacy at 10% or less per dose requires substantial dosing (> 30) to achieve thresholds divergent, and within-group subtypes that are up to 15% divergent [22]. This remarkable global diversity of HIV is the result of within-host evolution and adaption to immune pressure, and transmission of genetic variants from the host quasispecies over multiple rounds of viral replication. Target sites chosen for gene editing will therefore also need to reflect this genetic variability within and between individuals.
Globally conserved target sites are good starting points for gRNA design; if their high frequencies in the population are the result of selection, endonuclease-induced mutations are more likely to be highly deleterious to the virus. Indeed, it has been shown that highly conserved target sites are associated with improved antiviral activity and, importantly, delayed viral escape [10,29]. Identification of sites that are conserved at a global or subtype level may also allow for future deployment of these therapies in situations where obtaining individual patient HIV sequence data may not be feasible or practical. To this end, we identified gRNA target sites in HIV LTR that were highly conserved in global consensus sequences and tested the activity of these guides in vitro. Using a separate set of deep-sequence data [24], we showed that sites identified from our list of globally conserved targets that were present in the patient's sequence also showed greater within-host conservation. For computational efficiency, our approach looks for exact matches, but future enhancements could incorporate position-dependent penalties to account for the ability of Cas9 to bind in the presence of mismatches to the target site.
The experimental setup used to test candidate gRNAs was designed to allow us to compare gRNAs against each other while minimizing the confounding factors such as cell line-derived variation. We performed the assays under low transfection efficiency conditions and gated on mCherry-positive cells in order to limit plasmid copy numbers that could affect the ability to observe changes in GFP fluorescence intensity by flow cytometry. Since we have previously seen variations in transfection efficiency between different target site reporter plasmids when transfected under the same conditions, we incorporated two internal GFP-specific gRNAs as controls to be analyzed with each reporter. This allowed us to compare the relative activity across all of the LTR-specific gRNAs since they could not all be tested against each of the LTR reporters. We found that within the described transfection efficiency range, we saw comparable levels of relative GFP knockdown when using the two GFP control gRNAs.
Gene therapy approaches designed to cure an infected individual will need to ensure that all relevant within-host variants are targeted. Although early initiation of long-term cART has been shown to reduce the rate of HIV evolution, the virus is still thought to accumulate about 0.97 mutations/kb/year [13,14]. Using a mathematical model, we showed that variants that are not recognized and cleaved will be the major barrier to achieving functional cure thresholds. These variants, if replication-competent, have the potential to reactivate upon cART interruption and reseed the reservoir. Our model makes assumptions about the underlying distribution of quasispecies abundance, which is not fully understood. Yet, because CRISPR works on a fraction of quasispecies, our conclusions appear robust to simulated reservoirs with different absolute number of species (see Additional file 4: Figure S3). Estimating time to rebound based on reservoir reduction is challenging and various estimates of thresholds for depletion exist [15,16,[31][32][33]. In our simulations, we have included estimates for median 1 year and median lifetime remission from HIV rebound [15,16]. These thresholds were developed from natural reservoirs and might not correspond exactly to the perturbed CRISPR-treated reservoirs. Most importantly, the depletion itself depends on targeting viral quasispecies diversity. While we endeavor to estimate targeting proportions in the present work, further experiments are needed to fully understand the in vivo process.
Besides cleavage efficiency, target site conservation, and reservoir size, a number of other factors will also contribute to the clinical success of this type of gene therapy for HIV cure [28,[34][35][36]. For example, we have also not explicitly incorporated gene delivery in the current model but instead assumed that it is captured within the cleavage efficiency parameter ϵ. However, we have shown previously [37] that gene delivery of endonucleases using viral vectors is prone to large bottlenecks at the points of vector packaging, viral entry, and gene expression. Optimization of gene delivery is therefore another important step needed for the clinical success of gene therapies against HIV. We and others have shown that multiple doses will be needed to deplete the reservoir to achieve functional cure thresholds [15,16,37]. Dosing regimens will need to optimize efficacy while minimizing potential toxicity and off-target effects.
HIV has also been shown to rapidly escape endonuclease targeting in vitro [10,11,29]. Although this risk is reduced by keeping the patient on cART, it is still important for endonuclease-based therapies to target multiple sites concurrently in order to achieve sustained reservoir depletion and prevent the emergence of treatment resistance. Our simulations support these findings and show that even enzymes with high on-target efficiency will fail to produce a functional cure if there are target site variants present at frequencies as low as 1%. Two recent proof-of-principle studies showed that an approach with dual gRNAs targeting multiple genes can delay or completely prevent viral escape [12,38]. We identified paired and triplet sets of gRNA target sites that occur in over 98% of the population. Since these sites are likely to also be highly conserved within-host (as our results suggest), they would be good candidates for testing in vitro for activity. Although our mathematical model can incorporate multiplexed gRNAs by changing the coverage (ρ), it does not explicitly include dynamic emergence of treatment-resistant variants. Our model framework is amenable to emergent resistance but was not included for lack of information on these dynamics. Nor does the model include potential anatomic sanctuary sites where HIV diversity changes in time. The modeled CRISPR therapy assumes constant suppressive cART, and we rely on previous observations that potent cART prevents most ongoing evolution [13,[39][40][41][42][43].
A number of recent studies have designed LTR-based CRISPR strategies and shown broad antiviral activity against HIV in a number of different model systems [7,8,12,20,21,38,44,45]. LTR is an attractive target because there are two copies per provirus genome, and this allows a single gRNA to potentially cleave two independent regions, leading to a deletion of a majority of the provirus or mutations in one or both LTRs. Each of these potential outcomes is beneficial as they can all impact HIV replication and reactivation. However, we have shown here that pol may be a better genomic target for directed mutagenesis due to target site conservation, which allows targeting of a majority of variants with reasonable numbers of gRNAs in multiplexed designs. As a result, we believe that targeting multiple sites within pol may be a better approach than targeting LTR alone, which generally contains less conserved sites.
The weak correlation between predicted and measured activity scores is likely due to differences in the methods, cell lines, and experimental conditions used to generate the two sets of scores. The predicted activity score generated by the sgRNA designer tool is based on a broad genome-wide CRISPR-based screen that was used to train a machine learning model [17]. In spite of the differences in approaches, the fact that the scores are correlated is encouraging because it helps to further validate this broadly used metric.
One of the limitations of our within-host analysis is that we do not have detailed information about the patient cohort [24] such as treatment status, age at HIV diagnosis, and time of cART initiation and interruption, if any. These factors could potentially impact reservoir diversity. However, the current analysis is primarily aimed at demonstrating the importance and feasibility of designing gRNAs targeting a diverse viral population. Future work needs to address this in greater detail, possibly incorporating treatment-related variables to select gRNA designs.

Conclusions
In summary, we have performed a detailed computational analysis to identify optimal CRISPR target sites, taking into consideration both within-host and global viral diversity. We determined the in vitro activity of a set of gRNAs targeting highly conserved sites and showed a weak but positive correlation between measured and predicted activity. We used a mathematical model to simulate clinical application of this therapy and showed that although increased dose may overcome low target cleavage efficiency, inadequate targeting of rare strains is predicted to lead to rebound upon cART cessation even with many doses. Our results have applications beyond HIV and CRISPR since genetic diversity is an important consideration for any gene therapy platform targeting a heterogeneous population, whether it is a persistent viral disease such as hepatitis B virus, or even cancer.

HIV sequence datasets and pre-processing
For our analysis of global target site conservation, we obtained sequences from the Los Alamos National Laboratory (LANL) database. For each region of interest (gag, pol, LTR), we downloaded pre-made LANL alignments of all available group M sequences (2016 version). We extracted a majority consensus sequence using Geneious v10 [46] for all sequences in group M and for each subtype. We did not consider groups N, O, or P in our analyses because they represent a small fraction of HIV infections globally compared to group M and there are limited sequences available for these groups. However, our algorithms are easily adapted to run on any alignment provided.
For within-host analyses of target site conservation, we used deep-sequencing data (Additional file 1: Table S5) from a study of HIV-infected blood donors in Brazil [24]. Raw paired-end reads for each patient were trimmed to remove adapters and low-quality regions using Trimmomatic v0.32.2 [47] and mapped using Bowtie2 v0.2 [48] to the consensus sequence deposited by the authors to GenBank. These pre-processing steps (Additional file 3: Figure S2) were performed within the Galaxy software framework (https://galaxyproject.org/).

gRNA target site analysis
We developed a custom script to identify gRNA target sites for an input sequence given a specified PAM sequence (default 'NGG' for spCas9) and desired gRNA length w (default 20 nt). The algorithm finds all matches to the PAM sequence in the forward and reverse directions and returns, for each match, w bases upstream of the PAM sequence. We then used the sgRNA designer from the Broad Institute (https://portals.broadinstitute.org/gpp/public/analysis-tools/ sgrna-design) to determine predicted on-target efficacy score and off-target scores (threat matrix) [17]. On-target predicted activity scores are in the range [0,1] with higher values predicting more active guides and a score of 1 indicating successful knockout in the experiments in [17,23].
For each target site identified, we determined the number of exact matches found in an alignment of the region of interest (LTR, gag, or pol). We excluded all sites with close off-target matches to the human genome (> 3 matches in Match Bin I, i.e., CFD score = 1 [17]). For each region, we determined pairs and triplets of gRNAs by starting with the previously identified list of gRNAs and adding on guides that increase targeting when used in combination.
We computed target site conservation in terms of the frequency of occurrence of the target site (exact matches) within the alignment and also we used a measure of information content similar to what is used to generate sequence logo plots [49,50]. We applied a moving window of size 23 (corresponding to the width of gRNA) and computed conservation from the relative frequencies of bases in the alignment using the method of Schneider et al. [50] incorporating small-sample correction. The result is a value between 0 and 2 bits with higher values indicating greater sequence conservation. All analyses were performed in R/Bioconductor, and code is available on GitHub (http://github.com/proy chou/CRISPR).

Functional testing of gRNA activity
Starting with the list of target sites identified above in LTR, we selected gRNAs from a pool of the top 20 most conserved sites across group M overall, the top 10 most conserved sites in each subtype, and the top 20 pairs and triplets. As recommended by sgRNA designer, we excluded any gRNAs with on-target activity scores < 0.2.
We developed 4 LTR-GFP fusion reporter constructs using consensus sequences for all group M, subtype A, subtype B, and subtype C (further details in Additional file 5). Internal start codons and stop codons were identified within the sequence for each consensus LTR, and the reading frame with the fewest combined number of start codons and stop codons was identified. Reading frame 1 for group M contained 5 start and 4 stop codons, reading frame 1 for subtype A contained 3 start and 6 stop codons, reading frame 1 for subtype B contained 3 start and 6 stop codons, and reading frame 1 for subtype C contained 3 start and 5 stop codons. All the internal start and stop codons were modified for each consensus LTR sequence as follows: ATG to GTG -M to V; TGA to GGA -stop to G; TAG to GAG -stop to E; TAA to GAA -stop to E, so that one continuous open reading frame was generated. Each of the 4 modified consensus LTR sequences was then synthesized as a gBlock and cloned into a reporter plasmid vector (cloning details available upon request) as a fusion to the 5′ end of the eGFP ORF so that the MND promoter drove expression of a single continuous ORF (see Additional file 2: Figure  S1A for amino acid sequences). The majority of the 59 gRNA target sites identified for analysis within the group M, subtype A, subtype B, and subtype C consensus LTRs were not changed by start or stop codon modification, with the exception of overlapping gRNA targets 1 and 2, and overlapping gRNA targets 18 and 19. A separate reporter construct was generated for gRNAs 1, 2, 18, and 19 by fusing their target sequences to the 5′ end of the eGFP ORF so that the MND promoter also drove expression of a single continuous ORF (cloning details available upon request).
Of the 59 LTR-specific gRNA target sites we elected to screen for activity, 23 were present in the group M reporter, 27 were present in the group A reporter, 20 were present in the group B reporter, 18 were present in the group C reporter, and gRNAs 1, 2, 18, and 19 were not present in any LTR reporter. Three of the gRNA targets were present in all 4 LTR-reporter constructs, 8 were present in 3 LTR-reporter constructs, and 8 were present in 2 LTR-reporter constructs. To screen the activity of individual LTR-specific gRNAs, they were cloned into the BbsI site of the plasmid pU6-(Bbs1) CBh-Cas9 -T2A-mCherry (a gift from Ralf Kuehn; Addgene plasmid no. 64324) under the control of the U6 promoter. This plasmid expresses spCas9 and mCherry from the constitutive CBh promoter. Internal positive controls for GFP knockdown were used by also cloning gRNAs eGFP1 and eGFP2 targeting the sequences CAAC TACAAGACCCGCGCCG and GTGAACCGCATCGA GCTGAA into pU6-(Bbs1) CBh-Cas9-T2A-mCherry. To assay gRNA activity 2 × 10 5 , 293 cells were plated in 12-well plates and the following day individual wells were transfected by PEI transfection with 1000 ng of a Cas9/LTR-gRNA expressing plasmid and 250 ng of its corresponding LTR-reporter plasmid. At 24 h post-transfection, flow cytometry was performed and GFP fluorescence was analyzed in Cas9 expressing (mCherry positive) 293 cells to determine the level of GFP knockdown provided by each gRNA.

Intra-host target site conservation
Focusing on the pol gene, we identified spCas9 gRNA target sites within the HIV consensus sequence for each patient using the script described above, excluding any sites containing degenerate bases. We also determined which of the target sites we had previously identified from groupand subtype-level consensus sequences for pol were present in the patient consensus sequence. Using the average number of reads overlapping all identified target sites, we excluded any patients with < 5000× target site depth since we were interested in variants that may escape targeting by candidate gRNAs. For each target site, we determined the number of reads in the alignment containing an exact match to the target site and excluded any sites where coverage was less than 5000×. We then used the total number of reads that completely overlap the target site to calculate the percentage of exact target site matches.

Statistical analysis of within-host conservation
To test whether there were differences in target site conservation measured by mean percentages of exact target site matches per total reads, a linear mixed model was fit with percentage as the outcome and the consensus sequence group (group M, subtypes A-C, and patient) as the predictors. A random intercept for each subject by consensus group was used to account for within subject and group variation across the repeated outcomes. An overall test was performed from ANOVA for mixed models using the lmerTest package in R [52]. Post-hoc pairwise tests were also performed comparing the patient-derived sequences, group M, and subtype B (the circulating strain in the patient population). To compare the conservation using patient target sites to the consensus groups, we pooled group M and subtypes A-C into a single group for comparison in the model, while the random effects specification remained the same. P values corrected for multiple testing were also reported using the Holm method [53]. Code and data are available at http://github.com/proychou/CRISPR.

Mathematical model of reservoir depletion with simultaneous suppressive cART and CRISPR therapy
We have used a mathematical model to describe natural clearance of the HIV reservoir on consistent cART previously [27]. That model assumed an HIV reservoir that exponentially cleared with previously measured rates.
Here, we extended that model to consider simultaneous treatment with suppressive cART and CRISPR gene therapy. The reservoir is now conceived of as a population of different strains, and each strain is associated with some number of infected cells. cART is assumed to prevent ongoing replication, viral evolution, and/or increases of diversity. Additional CRISPR therapy targets some fraction of these strains, and depending on the coverage, or "proportion" (ρ), and the enzyme activity to those covered strains, or "efficacy" (ϵ), the reservoir is reduced accordingly with each successive CRISPR dose. Throughout the simulations, we use weekly doses τ = 7 days, but this choice is arbitrary and adjustable.
The natural clearance of the reservoir on suppressive cART was modeled as follows. For each strain, a clearance rate was randomly sampled so that the clearance of the entire reservoir agrees with previously measured population level statistics [25,26] such that the half-life of latently infected cells is normally distributed with mean and standard deviation of 3.6 and 1.5 years, respectively, or t f1=2g N ð3:6; 1:5Þ. Of note, this half-life represents the natural clearance rate of the replication-competent reservoir as measured by viral outgrowth assays [25,26]. In contrast, the half-life of HIV DNA is longer [54,55]. We call the strain-specific clearance rate θ s (per day). Each strain (indexed by s) is initialized with a number of infected cells L s (0) drawn from a log-normal distribution with average value μ s and standard deviation σ s = μ s so that each strain has size logL s ð0Þ N ðμ s ; σ s Þ . Then, we denote the total number of strains S and the total initial reservoir size L(0) that P S s¼1 L s ð0Þ ¼ L ð0Þ. The total number of strains is constrained by the initial reservoir size as S ≈ Lð0Þ=μ s . We can write model for a single strain without CRISPR therapy using an ordinary differential equation (ODE) model as _ L s ¼ −θ s L s , where the over-dot denotes derivative in time. Such an equation is solved simply, L s (t) = L s (0) exp(−θ s t), and applies for strains not in the covered CRISPR set, (s ∉ ∁), where ∁ ¼ f1; 2; 3; …jρSjg and |·| denotes rounding to the nearest integer. For strains in the CRISPR set, the dynamics are governed by the additional reduction in reservoir due to CRISPR, η(t, τ), such that the CRISPR instantaneously removes a fraction of the reservoir ϵL s (t) after each dosing time τ. We solve these equations accordingly for strains in and not in the covered set and sum to find the total reservoir size L(t) = ∑ s L s (t). Stochastic simulations and deterministic simulations result in similar results (data not shown). All code is freely available at http://github.com/proychou/CRISPR. Parameters relating to CRISPR (ϵ, ρ, μ s ) are varied throughout simulations. The reservoir initial size was held constant throughout simulations at~1 million cells [25,26,56]. The clearance rate of each strain was sampled from a normal distribution with mean half-life 3.6 years and standard deviation 1.5 years as has been measured previously [26]. In the stochastic simulation, strains do sometimes increase over time on cART, a realistic phenomenon. However, simulations were also performed with clearance rates of zero to similar results. Indeed, based on the timeframe of the present analyses (less than a year of cART), natural clearance has a minimal impact compared to CRISPR intervention.