Single cell resolution landscape of equine peripheral blood mononuclear cells reveals diverse immune cell subtypes including T-bet+ B cells

Traditional laboratory model organisms represent a small fraction of the diversity of multicellular life, and findings in any given experimental model often do not translate to other species. Immunology research in non-traditional model organisms can be advantageous or even necessary (e.g. for host-pathogen interaction studies), but presents multiple challenges, many stemming from an incomplete understanding of potentially species-specific immune cell types, frequencies and phenotypes. Identifying and characterizing immune cells in such organisms is frequently limited by the availability of species-reactive immunophenotyping reagents for flow cytometry, and insufficient prior knowledge of cell type-defining markers. Here, we demonstrate the utility of single cell RNA sequencing (scRNA-Seq) to characterize immune cells for which traditional experimental tools are limited. Specifically, we used scRNA-Seq to comprehensively define the cellular diversity of equine peripheral blood mononuclear cells (PBMCs) from healthy horses across different breeds, ages, and sexes. We identified 30 cell type clusters partitioned into five major populations: Monocytes/Dendritic Cells, B cells, CD3+PRF1+ lymphocytes, CD3+PRF1- lymphocytes, and Basophils. Comparative analyses revealed many cell populations analogous to human PBMC, including transcriptionally heterogeneous monocytes and distinct dendritic cell subsets (cDC1, cDC2, plasmacytoid DC). Unexpectedly, we found that a majority of the equine peripheral B cell compartment is comprised of T-bet+ B cells; an immune cell subpopulation typically associated with chronic infection and inflammation in human and mouse. Taken together, our results demonstrate the potential of scRNA-Seq for cellular analyses in non-traditional model organisms, and form the basis for an immune cell atlas of horse peripheral blood.


INTRODUCTION
The utility of studying non-traditional model organisms for biology, and particularly immunology, research is increasingly recognized. Experimental biology has been dominated by work in a relatively short list of well-studied model organisms such as the fruit fly and the mouse.
Work in these species has many advantages including short generation times, tractability for genetic manipulation, high quality well-annotated genomes, and a large and diverse toolkit of wellvalidated reagents and protocols for experimentation (1). Despite the utility of these organisms in uncovering fundamental biological principles, they are not without their limitations. For example, they represent a small fraction of the diversity of multicellular life, and findings in any given experimental model do not always translate between species, as has been particularly well described for mouse-human translational studies (2). Many biological phenomena relevant to human health and society cross species boundaries, such as the circulation of established and/or emerging zoonotic pathogens in animal reservoirs (3) as well as animal health and its contribution to agriculture and the global food supply. As such, a holistic One Health approach to biology and immunology is essential and has been increasingly recognized by public health associations including the World Health Organization (4)(5)(6).
While the importance of studying diverse species is recognized, doing so can prove challenging due to a dearth of experimental tools which are available for more commonly investigated laboratory organisms. In immunology, flow cytometry is the traditional "gold standard" technique for defining cell subpopulations (7,8). As flow cytometry and related mass cytometry technologies advance, with increasing capacity of parameters measured in a single panel, these methods can define immune cell subtypes at ever higher resolution (9). However, irrespective of resolution, flow cytometry ultimately relies on the availability of highly specific, well-defined antibodies against relevant surface markers (7). Such reagents are readily available from commercial sources for widely studied species such as mouse and human, but availability for other species can be limited. Raising, validating and conjugating custom antibodies to every target of interest is laborious, expensive, and not practical for most research laboratories. Furthermore, it requires a priori knowledge of cell type-defining surface markers.
Single cell RNA-Sequencing (scRNA-Seq) offers an alternative to flow cytometry as it defines different cell types (and their functional states) by gene expression patterns rather than surface marker expression. Recent advances in scRNA-Seq technology have enabled increased throughput and decreased cost per cell, allowing researchers to process thousands to tens-ofthousands of cells in a single experiment using droplet based microfluidics (10)(11)(12). scRNA-Seq offers many potential advantages for characterizing cell types in non-traditional model organisms including i) it can be applied to cells of diverse species without specialized reagents, ii) it does not rely on a priori marker selection or reagent availability, and iii) it can be used to identify novel markers for focused studies (13).
In this study, we demonstrate the potential of scRNA-Seq for discerning and discovering cell types in a non-traditional model organism, the horse. Equids are agriculturally and economically important globally, and host multiple zoonotic diseases including eastern equine encephalitis virus, Hendra virus, methicillin resistant Staphylococcus aureus (MRSA), Salmonella spp., Leptospira spp., and Streptococcus equi zooepidemicus (14). Study of these natural infections in horses could be important from a One Health perspective in multiple aspects: i) to develop tools to prevent infection of horses with zoonotic diseases, ii) to break the chain of horseto-human transmission, iii) to understand immunologic determinants of protection, clearance, and disease that could translate to improved understanding of human correlates, and iv) to improve the health of an economically important species.
Recent work from our laboratory and others established the current state-of-the-art flow cytometry protocols for immunophenotyping equine PBMC (15). However, compared to mouse and human, many immune cell subtypes remain to be defined at high resolution in the horse.
Here, we used scRNA-Seq to characterize equine PBMC at unprecedented cellular resolution and present an immune cell atlas for horse peripheral blood. We identified 30 cell type clusters comprising major CD3 + lymphocyte, B cell, Monocyte/Dendritic Cell (DC) and Basophil cell populations. Clusters were annotated based on gene expression signatures, revealing several immune cell subtypes not previously described in horses. Interspecies comparisons with human PBMC scRNA-Seq datasets uncovered conserved blood DC subpopulations, and identified a spectrum of monocyte cell states similar to humans. Remarkably, we found that a large portion of the horse peripheral B cell compartment is comprised of T-bet + B cells. Cellular analogs of this population in human and mouse are associated with chronic infections (16,17).

Single cell RNA-Seq of equine PBMC resolves a diversity of immune cell types
We performed scRNA-Seq on fresh PBMC collected from 7 healthy adult horses. To generate broadly representative datasets, included horses spanned different breeds (Warmblood, Thoroughbred, Quarter Horse), ages (6-10 years), and both sexes (Table 1). In preliminary quality assessments of scRNA-Seq data processed with standard software workflows (10X Genomics Cell Ranger pipeline, EquCab3.0 reference genome with Ensembl v95 transcript annotations), we observed unexpectedly low numbers of genes detected per cell (Fig. S1A). Upon inspection of read mapping patterns for select genes known to be highly expressed in equine PBMC but found to be absent from our datasets, we frequently observed reads mapped immediately downstream of annotated transcript regions (Fig. S1B). This pattern is consistent with effective sequencing of transcript 3' regions (as expected with 10X Chromium 3' scRNA-Seq) but omission from quantification due to transcriptome annotations. In particular, incomplete annotation of transcript 3' untranslated regions (UTRs, the most frequent transcript region captured by 10X Chromium 3' scRNA-Seq) (18), is common in non-traditional model organisms relative to mouse or human reference transcriptomes (19). We therefore implemented an optimized data processing workflow that included the End Sequence Analysis Toolkit (ESAT, for "rescuing" reads mapped to unannotated 3' transcript regions) (20), along with additional modifications (manually annotated immunoglobulin genes, quantification strategy for genes with multiple annotations, details in Materials and Methods) for this scRNA-Seq dataset. This approach significantly increased the number of genes detected per cell (Fig. S1A), and resulting output was used for all downstream analyses.
Unsupervised graph-based clustering of 36,477 cells (post-quality filtering) resolved 31 clusters (Fig. 1A). Based on PCA hierarchical clustering and marker gene expression patterns ( Fig. 1B, 1C), we grouped all clusters into 5 "major cell groups": CD3 + PRF1lymphocytes, CD3 + PRF1 + lymphocytes, B cells, Monocytes/Dendritic cells (DCs), and Basophils. Complete marker gene lists for major cell groups are presented in Dataset S1. All major cell groups were represented at similar proportions across all 7 horses (Fig. 1D). To characterize equine PBMC at high resolution and establish a corresponding peripheral blood immune cell atlas, we independently analyzed scRNA-Seq data for the constituent clusters of each major cell group, except Basophils due to the low number of cells.
Presumptive DC clusters (24,19,26) were also analyzed by differential gene expression analysis (Dataset S4). Differentially expressed genes in cluster 24 included CLEC9A, CADM1, and BTLA (Fig. 2F, Dataset S4), all of which are immunophenotyping markers for cDC1 in humans and mice (27) (in mice, CLEC9A is also expressed on plasmacytoid DC (28)). Genes with significantly enriched expression in cluster 19 included FCER1A and SIRPA (Fig. 2F, Dataset S4), which are flow cytometric markers of cDC2 in humans and mice (Reviewed in (27)). DC subsets are also defined by the transcription factors that regulate their development and function, particularly by relative levels of IRF4 and IRF8 (27). Although IRF4 transcripts were sparsely detected across all DC clusters (likely due to the incomplete sampling depth characteristic of droplet scRNA-Seq), IRF8 was expressed at high levels in cluster 24 (cDC1) and significantly lower levels in cluster 19 (cDC2). Cluster 24 also exhibited high expression of BATF3, another characteristic transcription factor of cDC1 (29). In addition, top ranked differentially expressed genes in cluster 26 included IRF7 and TCF4 (E2-2) (Fig. 2F, Dataset S4), both of which are fundamental to plasmacytoid DC (pDC) development and function (30,31).
To further support our cell type annotations and assess potential differences in monocyte/DC subsets between horses and humans, we performed cross-species hierarchical clustering with a human PBMC public reference scRNA-Seq data set ( Fig. S2A-B, Fig. 2G).
Equine clusters annotated as classical monocytes clustered first with each other, and next with human classical monocytes (defined by scRNA-Seq gene expression and confirmed with corresponding CD14/CD16 immunophenotyping feature barcoding data). This pattern likely reflects the heterogeneous transcriptional states of classical monocytes defined by CD14 expression, and suggests broad similarities of this cell type across species. Equine non-classical monocytes clustered with human intermediate and non-classical monocytes. Remarkably, each DC subgroup clustered by cell type rather than species, indicating strong similarities of gene expression patterns between horse and human. These results further support three distinct DC subpopulations in horse peripheral blood that correspond with cDC1 (cluster 24), cDC2 (cluster 19), and pDC (cluster 26) in humans.

The equine peripheral B cell compartment includes a large proportion of T-bet + B cells
We next performed an in-depth analysis of B cell clusters, as defined by their expression of MS4A1 (CD20), CD79A, MHC-II components (i.e. DRA), and/or immunoglobulin transcripts

Equine T-bet + B cells share gene expression features with human T-bet + B cells and can be identified in equine PBMC by flow cytometry
In humans, T-bet + B cells have been described as "atypical memory B cells," appearing in the peripheral blood during chronic infection and/or inflammation (17,34). Although specific markers and/or gene expression patterns vary in different datasets, these B cells are often found to express ITGAM (CD11b), ITGAX (CD11c), as well as genes that modulate BCR signaling (including FCRL4, FGR, and HCK) (35)(36)(37)(38)(39). Moreover, a recent study of T-bet + B cell populations in the context of chronic HIV infection demonstrated expression of genes associated with germinal center B cells (16,40). We assessed expression of several of these characteristic genes in B cell clusters, and observed patterns consistent with multiple reports in humans (Fig. 4A). Among B cells, ITGAM (CD11b) expression was restricted to clusters 0 and 22, while FCER2 (CD23) was virtually absent from these clusters. Although sampling for ITGAX (CD11c) and ENSECAG00000031055 (annotated as CR2/CD21) was insufficient for differential expression testing, we detected ITGAX (CD11c) positive cells in T-bet + clusters 0 and 22 (Fig. 4A). Moreover, we detected significantly elevated expression of FCLR4, FGR, and HCK in these clusters (Fig.   4A, Dataset S5).
Based on these scRNA-Seq expression patterns, we developed a flow cytometry panel to identify equine T-bet + B cells by protein expression. Since T-bet exhibits 92% amino acid sequence identity between horses and humans (GenPept accessions XP_023508425.1 and NP_037483.1, respectively), we selected an anti-human T-bet antibody for intracellular labeling.
We then adapted a previously validated equine PBMC immunophenotyping panel (15) to identify CD3 -CD14 -PanIg + B cells (Fig. S4). As predicted from our scRNA-Seq data, we detected an abundant CD11b + B cell population with high expression of T-bet ( Fig. 4B) that did not express surface CD21 or CD23 (Fig. 4C). T-bet expression was not detected in other B cell gates. We also assessed surface isotype usage of T-bet + B cells by flow cytometry; 51 ± 18% T-bet + B cells were IgM hi , 23 ± 12% were IgG1 + , and 24 ± 8% expressed neither IgG1 nor IgM (Fig. 4D, E). It is unclear whether IgM hi T-bet + B cells reflect an antigen-inexperienced naïve subset, a recently activated subset, or a memory cell subset that did not undergo class switch recombination. By flow cytometry, T-bet + B cells comprised 44 ± 17% of total B cells, a percentage that was correlated, but consistently lower, to the percentage observed in scRNA-Seq data (Fig. 4F), perhaps reflecting incomplete sensitivity of PanIg antibody labeling for all B cells.
Taken together, our flow cytometry results validate the existence of a novel population of T-bet + B cells initially identified by scRNA-Seq analysis, which also demonstrated similarities with human T-bet + B cells associated with chronic infection and inflammation.
All clusters were characterized by expression of the cytotoxic effector PRF1 and CTSW, a cathepsin whose expression is associated with cytotoxic capacity (41) (Fig. 1B). Based on hierarchical clustering of integrated PCA data, we partitioned our annotations into three distinct transcriptional programs (Fig. 5C). Although all clusters expressed high levels of CD3 transcripts (CD3D, CD3E, CD3G, Fig. S5A), based on differential gene expression results (Dataset S7), this major cell group likely includes both cytotoxic T cells and NK cells.
Transcriptional profiling studies of human and mouse cells often describe challenges in distinguishing cytotoxic lymphocyte subpopulations, with memory αβ CD8 + T cells, NK cells, NKT cells, and γδ T cells exhibiting considerable overlap in gene expression patterns (42)(43)(44)(45)(46). Our data suggest similar overlap exists among equine cytotoxic lymphocyte subpopulations. We annotated clusters 5 and 14 as CD8 + "antigen experienced" or "non-naïve" T cells. While overall quite similar, cluster 5 exhibited features more consistent with CD8 + T central memory cells in humans (GZMK/GZMM protein expression, absence of GZMA protein), while cluster 14 exhibited features more consistent with CD8 + T effector memory cells (GZMK/GZMM/GZMA protein expression) (47). We emphasize that these expression patterns are not fully distinct and are unlikely to correspond perfectly to subpopulations defined by traditional flow cytometric markers.
Furthermore, although these cells appear to share common cytotoxic gene expression programs, we observed notable within-cluster heterogeneity. Indeed, cluster 5 contained mutually exclusive subgroups of cells which expressed either CD8A or CD4 (Fig. S5B).
Although expressing many of the same cytotoxic effector genes, cluster 20 appeared distinct from other cytotoxic lymphocyte clusters (Fig. 5A, 5C). Differential gene expression analysis revealed highly significant elevated expression of TRDC. Cells in cluster 20 also expressed lower levels of TRAC, TRBC1, and TRBC2 relative to other cytotoxic lymphocyte clusters (Fig. S5C). Based on these expression patterns, we annotated cluster 20 as cytotoxic γδ T cells. Interestingly, this cluster demonstrated high and specific (or nearly specific) expression of several genes associated with cytotoxicity, including GNLY, KLRB1, and KLRF1. These results support the existence of equine γδ T cells, which have not been definitively characterized.
Moreover, they suggest that these cells employ unique cytolytic mechanisms compared to other equine cytotoxic lymphocytes. ZNF683 has also been described in human cytotoxic T cell subsets (50). We annotated cluster 17 as NK cells based on specific expression of ENSECAG00000031528 (annotated as KLRD1/CD94), which encodes the cell surface lectin central to NKG2 functions (Fig. 5E). This cluster also exhibited specific expression (within the cytotoxic lymphocyte major cell group) of FCER1G and CD247, both of which are important for NK cell activation signal transduction (51).
In addition, and in contrast to cluster 4, this putative NK cell cluster exhibited diminished or absent expression of CD2 and CD5, genes frequently used as T cell markers in humans (52) (Fig. 5E).
Of note, multiple descriptions of equine NK cells by flow cytometry or immunohistochemistry have purposefully excluded CD3 + cells (53)(54)(55). However, consistent with scRNA-Seq, our flow cytometric analysis identified a well-defined CD3 + CD16 + lymphocyte population (Fig. S6A). Given their expression of TCR transcripts, it remains unclear whether these cells have the capacity to respond to specific antigen presented by traditional MHC-I or MHC-II.
Although cluster 4 has gene expression patterns consistent with both cytotoxic T cells and NK cells, the absence of definitive marker genes and/or genes associated with NK cell-restricted functions made it challenging to annotate this cluster. Based on the overlapping gene expression programs described in cytotoxic lymphocytes in better characterized species, we suspect this cluster could include an additional type of CD8 + cytotoxic T cells, semi-invariant TCR cytotoxic T cells (e.g. Mucosal-associated invariant T cells, NKT cells), and/or an additional type of NK cell.
The latter possibilities are further supported by cross-species comparison to human cytotoxic lymphocytes (Fig. 5F). Alternatively, cluster 4 may represent a novel type of cytotoxic lymphocyte unique to horses.

CD3 + PRF1clusters include naïve T cells and heterogeneous CD4 + T cell populations
The CD3 + PRF1 -T cell major cell group is composed of 11 clusters, including proliferating T cells (cluster 31, Fig. 6A), which were represented at similar frequencies across all horses examined (Fig. 6B). Although the most abundant in both cell and cluster numbers, these subpopulations were also the most challenging to effectively annotate, due in large part to the relatively subtle transcriptional differences detected between most clusters. In our experience, resting T cell populations can be difficult to distinguish by droplet microfluidics scRNA-Seq data.
Despite these limitations, we were able to make several observations regarding the non-cytotoxic T cell constituent clusters. First, we distinguished naïve T cells (clusters 2, 7, 8) based on elevated expression of CCR7, SELL (L-selectin), and the LEF1 transcription factor (Fig. 6C). Naïve T cells could be further partitioned into CD4 + (Clusters 2, 8, not significant by differential gene expression) and CD8 + (Cluster 7) subpopulations (Fig. 6C). Although the remaining "non-naïve" clusters (presumably antigen experienced) exhibited significant gene expression differences, we were not able to confidently assign clusters to T cell subsets traditionally defined by flow cytometry (e.g. memory Th1, memory Th2, memory Th17, etc.).

High resolution landscape of equine peripheral blood mononuclear cells
Given the improved resolution and novel cell populations identified by scRNA-Seq, we grouped annotated cell clusters into summary populations and provided "reference ranges" for their frequency in healthy horses (Fig. 7A). We also evaluated how the frequencies of major cell groups defined by scRNA-Seq compared to those that can be resolved by current state-of-the-art flow cytometry equine PBMC immunophenotyping (adapted from (15) Taken together, these results demonstrate that our scRNA-Seq cell cluster annotations are consistent with state-of-the-art flow cytometry methods, but can resolve cell populations at much higher resolution and sensitivity.

DISCUSSION
Here, we used scRNA-Seq to define the cellular landscape of equine peripheral blood immune cells. This study demonstrates proof-of-concept for the utility of high throughput scRNA-Seq as a tool to characterize distinct cell types in species for which traditional flow cytometric tools are limited. Furthermore, interspecies analyses demonstrate the potential of scRNA-Seq for comparative studies exploring the evolution of cellular specialization of the immune system.
Prior to this work, state of the art equine PBMC immunophenotyping methods (15) (22)), have been described as generally conserved across mammalian species (56), and CD16 + monocytes have been previously observed in horses (53). Indeed, our results support the existence of similar monocyte subsets in horse and human peripheral blood. Moreover, the gene expression data supplied by scRNA-Seq offer insight into the putative functions of these cellular subsets. These include distinct trafficking programs (based on adhesion molecules and chemokine receptors) and antigen presentation capacity (expression of MHC-II genes is generally anti-correlated with CD14, with highest expression in the classical mono3 cluster, followed by non-classical mono). These data also include potential novel surface markers (e.g. CD8A for nonclassical monocytes) for improved immunophenotyping by flow cytometry, though RNA transcript levels may not necessarily correspond with surface protein expression. We also identified three dendritic cell clusters, with gene expression consistent with the cDC1, cDC2, and pDC subsets described in humans (27)  ). Our data also indicate that, although they share many features with corresponding DC subsets in other species, equine DC subsets exhibit notable differences. For example, we did not detect ITGAM (CD11b), a marker for murine cDC1 (57), at appreciable levels in equine cDC1.
Informed by the gene expression programs defined here, additional experimental characterizations are necessary to definitively assign functions to these different subsets, each of which is likely to play a critical role in equine immunity.
In addition to identifying and/or validating cell types that we anticipated would be present in equine peripheral blood, the unsupervised clustering approach to scRNA-Seq data also revealed previously undescribed and unexpected cell populations. Within the B cell compartment, we expected to detect clusters consistent with naïve B cells, memory B cells, and antibody secreting cells, based in part on human PBMC scRNA-Seq data and equine PBMC flow cytometry data (15). In addition to these clusters, we were surprised to observe two additional clusters of apparent B cells (as defined by high expression of MS4A1/CD20 and immunoglobulin transcripts) with gene expression patterns notably distinct from presumptive naïve and memory clusters. Cells in these clusters, characterized by specific expression (within the B cell major cell group) of the T-bet transcription factor (TBX21), were the most abundant B cells across all seven horses under investigation. Corresponding populations of T-bet + B cells were not observed in human PBMC scRNA-Seq data, and have not been previously described in horses. In mice, T-bet + B cells have been shown to be important for antiviral humoral immunity (34,62,63). In humans, T-bet + B cells have been detected in peripheral blood in a variety of chronic inflammatory contexts including systemic lupus erythematosus (64,65), chronic malaria exposure (66,67), and chronic viral infection (16,40,68). Although a universal definition and designated function for these cells remains elusive, T-bet + B cells are often classified as atypical memory B cells and, at least in some contexts, are thought to arise from repetitive BCR stimulation (17). Indeed, human T-bet + (16). The equine T-bet + B cell populations identified in the present study share many features with the atypical memory B cell populations described in humans. In addition to TBX21/T-bet expression, horse T-bet + B cell populations exhibit similar gene expression patterns, including enriched expression of FCRL4, FGR, and HCK, the protein products of which modulate BCR signaling (38). Furthermore, a recent study of T-bet + B cells in HIV demonstrated that they express genes characteristic of germinal center B cells (16). Interestingly, we detected specific expression of AICDA (encoding activationinduced cytidine deaminase) and elevated expression of APEX1 in equine T-bet + B cells, suggesting that the T-bet + B cells identified here may represent equine equivalents of the T-bet + atypical memory B cells described in humans. If these cells are elicited by chronic antigenic stimulation, it is plausible that horses chronically exposed to numerous pathogens common in standard boarding conditions (e.g. equine alpha and gamma herpesviruses, influenza, rhinitis viruses, hepacivirus, parvovirus-hepatitis, coronavirus, etc.) could expand this population. While viral exposure burdens are likely to be largely similar to burdens experienced by humans, horses in the northeastern U.S. are also frequently exposed to Borrelia burgdorferi (agent of Lyme disease) (69) and Sarcocystis neurona (agent of Equine protozoal myeloencephalitis) (70), and are continuously infested with or re-exposed to gastrointestinal nematodes (71). The horses in this study did not show signs of active infection or inflammation, as they all had normal complete blood count, serum amyloid A, iron indices, and globulins. Moreover, the surprisingly high frequency of these T-bet + cells within the B cell compartment suggests that they may provide scRNA-Seq is molecularly compatible with presumably any animal species as most droplet microfluidics scRNA-Seq platforms select mRNAs for barcoding and downstream sequencing based polyA tails, a feature common across metazoans. Additional requirements for scRNA-Seq analysis include a genome (or at minimum, transcriptome) sequence to which reads are mapped, and gene/transcript annotations against which mapped reads can be quantified.

B cells are enriched among virus-specific B cells in HIV infection
Should transcriptome annotations be insufficient for robust scRNA-Seq analysis, as may be the case for less commonly studied organisms, read assignment/quantification strategies can be modified with specialized software tools (e.g. ESAT (20), as implemented here) and/or annotations can be supplemented/replaced with custom annotations derived from bulk RNA-Seq data. Interpretation of scRNA-Seq results can be greatly facilitated by gene/transcript annotations with comprehensive orthologue annotations for multiple species, but this is not a requirement.
Without the need for species-specific reagents, and a constantly growing catalog of species with sequenced and annotated genomes, we anticipate that scRNA-Seq will be an increasingly useful research tool for non-traditional model organisms.
Despite the many insights gained from our PBMC analyses, scRNA-Seq, particularly for characterizing cell mixtures from diverse animals, is not without limitations. In the present study, although defining subpopulations with unsupervised clustering methods was reasonably straightforward, assigning putative cell types to each cluster presented challenges. Ideally, automated cell type classification based on external datasets and/or prior knowledge could minimize biases introduced by supervised annotation (72,73). Recently developed scRNA-Seq data integration and cluster annotation tools have begun to implement this functionality (74)(75)(76).
We made attempts to apply several of these strategies in comparing equine PBMC to human PBMC, but observed generally poor performance, which we attributed to insufficient interspecies orthologue annotations (data not shown). Instead, we adopted a supervised approach based on prior knowledge of human and mouse immune cells to assign likely cell types. We, therefore, emphasize that our presumptive cell type annotations are not definitive, and ultimately require experimental validation by complementary methods, as we pursued with flow cytometry for T-bet + B cells (Fig. 4). Furthermore, for many clusters, most notably in the CD3 + PRF1lymphocytes major cell group, we were unable to confidently assign cell types due to limited detection of informative differentially expressed genes. This could be a result of suboptimal clustering (i.e. heterogeneous clusters), relatively low transcript sampling depth, and/or discrepancies in mRNA and corresponding protein expression by which T cell subsets have been previously defined.
Many of these issues are likely to be mitigated in the future by perennially improving genome and orthologue annotations, scRNA-Seq methodologies with increased per cell sampling depth, and novel software tools for intra-and inter-species data analyses.

Research subjects and cells
Horses studied here consisted of 3 mares and 4 geldings, 6 to 10 (mean 7.9) years old, 3 Warmbloods, 3 Thoroughbreds, and one Quarter Horse. Horses were determined to be healthy by physical examination, serum biochemistry (including globulins and iron indices), complete Approximately 50 mL of blood was collected from each horse by standard jugular venipuncture. Immediately following collection, PBMC were isolated by Ficoll gradient centrifugation, as previously described (15). Residual erythrocytes were removed by ammonium chloride lysis.
All studies were conducted under approval of Cornell University Institutional Animal Care and Use Committee.

Single cell RNA-Seq
Within one hour of isolation, fresh PBMC were processed for single cell RNA-Seq on the

scRNA-Seq Data Processing
scRNA-Seq data will be made available in the GEO repository, accession number pending.

Reference genome and transcript annotations
The EquCab3.0 reference genome (77) was used in all analyses. Reference transcript annotations (Ensembl v95) were supplemented by manual annotation of the immunoglobulin heavy chain and light chain loci as described by Wagner, et al (Supplemental Dataset S6, (32)).

Read mapping and quantification
Reads were assigned to cell barcodes, mapped and quantified per gene using the CellRanger workflow (v 3.0.1, 10X Genomics) with default parameters ("standard workflow"). In our optimized workflow, BAM files generated by CellRanger were reformatted (appending cellular barcode and UMI sequence to alignment read names) and were input to the End Sequence Analysis Toolkit (ESAT, (20)). Briefly, ESAT evaluates reads mapped immediately downstream of annotated genes for potential quantification with the adjacent gene, an approach particularly relevant to 3' scRNA-Seq data with reference transcriptomes with incomplete 3'UTR annotations.
To eliminate ambiguous read assignments due to "overlapping genes" (i.e. exons from two different genes on + and -strands sharing the same genomic coordinates), the immunoglobulin-supplemented reference transcriptome (Ensembl v95) was additionally modified to remove overlapping exon intervals on opposite strands. Reformatted CellRanger BAM files were processed through ESAT in two rounds. First, ESAT was run (2500 nt extension window) with the modified transcriptome reference and set to ignore any duplicated genes. Next, to recover quantification of genes duplicated in the Ensembl v95 reference (n = 185 duplicated genes), ESAT was run (2500 nt extension window) a second time with a filtered reference containing only duplicated genes; resulting read counts were divided across gene duplications and appended to the initial gene x cell count matrix.

Doublet removal
Putative "doublet" cell barcodes were identified and removed from downstream analyses with the DoubletDetection tool (78).

Filtering, normalization, and data integration
Data were filtered to exclude genes detected in less than 3 cells (per subject), to exclude cells with less than 750 UMIs, and to exclude cells with greater than 5% UMIs assigned to mitochondrial genes (e.g. dead or dying cells). Gene-cell count matrices were independently normalized with SCTransform (79), and the top 5000 most variable genes (variance-stabilizing transformation) were selected for each subject. To minimize subject-and/or batch specific effects, datasets from all subjects were integrated on the first 40 canonical correlation components identified on the union of highly variable genes identified per subject. Immunoglobulin heavy chain and light chain genes were excluded from integration and clustering analysis.

Differential gene expression analysis
Differential gene expression analyses were conducted using edgeR v3. 26.8 (82, 83), with additional modifications for scRNA-Seq data (84). Gene expression linear models included factors for cellular gene detection rate, subject, and cluster (as identified in Seurat analysis above).
Specific contrasts are detailed in relevant Results sections and/or figures. For analyses other than comparisons among CD3 + PRF1 + and CD3 + PRF1cell clusters, differential gene expression was defined as adjusted p-value < 0.05 (Benjamini-Hochberg correction) and moderated log2 foldchange > 1 (as determined in edgeR model). Differential gene expression for CD3 + PRF1 + and CD3 + PRF1cell comparisons used a less stringent fold-change cutoff (moderated log2 foldchange > 0.58) to account for reduced dynamic range of gene expression observed in these clusters. For all analyses, genes expressed (i.e. greater than or equal to 1 UMI) in less than 25% of cells for at least one group within a contrast were excluded from differential expression results.
Resulting differential gene expression lists were further annotated for putative surface protein expression by intersecting one-to-one gene orthologs with the Human Surface Protein Atlas (85).

Horse-human PBMC scRNA-Seq cross-species correlation analysis
Cross-species scRNA-Seq correlation analyses were conducted using an approach based on Zilionis et al. (86). Human and horse gene-cell count matrices were filtered to keep only those genes with high confidence 1-to-1 orthologues across species (as defined by Ensembl v95). For each species and each major cell group (monocyte/dendritic cells, B cells, CD3 + PRF1 + lymphocytes, CD3 + PRF1lymphocytes), following normalization with SCTransform (79) genes were ranked by pearson residual, and genes above the 1.5*inflection point were selected as highly variable genes. Lists of highly variable genes in human and horse datasets were intersected, and the resulting list of orthologs present in both species was used for clustering analysis. Clustering was performed on natural log normalized gene x cell count matrices and clustered on Pearson correlation distance by Ward's method (87). Results were visualized by dendrogram with the dend function in R.

Immunophenotyping of equine PBMC by flow cytometry
The flow cytometric phenotyping protocol was adapted from (15). A list of primary antibodies is included in Table S1.    18,11,12,28) and DCs (19,24,26). (D) Heatmap of genes differentially expressed (adjusted p-value < 0.05, log2 fold-change > 1 for each cluster versus all other clusters) by each cluster, with select genes labeled at left. (E) Dot plot of select genes differentially expressed across monocyte clusters. Dot size is proportional to number of cells with detectable expression of indicated gene. Dot color intensity indicates gene expression values scaled across plotted clusters. *Gene ID ENSECAG00000006663 is labeled FCGR3A/B based on Ensembl/NCBI annotations. (F) Dot plot of select genes differentially expressed across DC clusters. Additional details as in (E). (G) Hierarchical clustering of equine PBMC scRNA-Seq data (monocyte/DC clusters) and human PBMC scRNA-Seq data (monocyte/DC clusters). Median-normalized average expression values for highly variable human/horse one-to-one orthologues were calculated for each cluster, and clustering was performed on Pierson distances by Ward s method.