TrackUSF
TrackUSF is designed for high-throughput automated analysis of auditory recordings in the ultrasonic range (20–100 kHz). As depicted in Fig. 1, each auditory clip is divided into 6-ms fragments which are filtered using a 15-kHz high-pass filter. First, all fragments that contain signals exceeding a predetermined power threshold (ultrasonic fragments, USFs herein) are collected. Notably, this threshold is the only parameter that needs to be predetermined by the user. Then, the power spectrum between 15 and 100 kHz of each USFs is converted to 16 Mel-frequency cepstral coefficients (MFCCs). Mel-frequency features represent the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel-scale of frequency, with the frequency bands equally spaced according to the Mel-scale [31, 32]. MFCCs of USFs pooled from all audio clips of the experiment are then analyzed together using a 3-dimensional (3D) T-distributed Stochastic Neighbor Embedding (t-SNE) for visualization of the multi-dimensional dataset. 3D t-SNE models each high-dimensional vector by a point in a three-dimensional space to such a degree that similar vectors are modeled by nearby points, while dissimilar vectors are modeled by distant points with high probability. Following t-SNE analysis, distinct clusters are defined, either manually or using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) automatic clustering algorithm (see graphical user interface in Additional file 1: Fig. S1). In addition to the t-SNE graphs, the software generates a Matlab file for each audio clip, which contains the time stamps and cluster affiliation of each USF. This enables the software to present the detected USFs on the spectrogram of the audio clip and to analyze the power spectrum density (PSD) of any given combination of clusters and the number of USFs for each cluster. Another type of output is an Excel file detailing the number of USFs of each cluster for each audio clip analyzed.
Validation of the TrackUSF methodology with mouse mating calls
To validate TrackUSF, we first compared it to the manual USV-based methodology [6, 33] by analyzing mating calls of C56BL/6J and BalbC mice. It should be noted that hereafter we use the term “call” for any type of ultrasonic vocalization, regardless of its structure or sequence. The USV-based analysis took ~30 work-hours of a well-trained observer, while TrackUSF processed the same data in ~15 min on a standard computer. Figure 2A depicts the t-SNE analysis of all USFs detected by TrackUSF, with each USF represented by a single dot color-coded according to the mouse strain. This analysis revealed segregation between C57BL/6J (red) and BalbC (blue) USFs, suggesting distinct vocalization characteristics. We used the option of manual clustering of TrackUSF to define four clusters of USFs (Fig. 2A, gray lines) based on separation in space and on the distinct segregation of the two strains in each cluster. This enabled us to inspect each USF with respect to its corresponding USV by overlaying detected USFs onto audio-clip spectrograms. As exemplified in Fig. 2B, groups of USFs, denoted by their cluster numbers, overlap distinct USVs. The first example (Fig. 2Bi) included only non-vocal sounds and was enriched with USFs from cluster 1 (Fig. 2A), suggesting that cluster 1 is mostly composed of non-vocal sounds (herein termed noise). Other examples include USVs represented by USFs originating from clusters 2–4 (Fig. 2Bii–iv).
To further analyze each of the clusters we defined, we plotted the PSD profile of USFs from each cluster. As apparent in Fig. 2C, all clusters, except for cluster 1, showed clear, distinct peaks at specific frequencies. In contrast, cluster 1 included USFs of variable frequencies, mainly in the lower range. Given this PSD profile, and our findings that these USFs represent noise, cluster 1 was excluded from all downstream analyses of this dataset.
To compare the results of TrackUSF with those obtained using the manual USV-based methodology, we plotted the distribution of the USFs for each cluster and the manually detected USVs over time. As depicted in Fig. 2D, manually detected USVs appeared in sequences (songs), with prolonged periods of silence between them. Notably, almost all USVs were represented by at least one USF (from clusters 2–4), with no apparent false-positive USFs.
To evaluate the effectiveness of TrackUSF as an analysis tool for ultrasonic vocalizations, we compared its analytical abilities to those of DeepSqueak, one of the most cited (cited by 99 articles, February 3, 2022, Google Scholar) recent computerized tools for such analysis [27]. We set the tonality level of DeepSqueak to 0.15 arbitrary units (a.u.) since the default level of 0.3 a.u. yielded poor results. Analyzing the same dataset described above (Fig. 2) using DeepSqueak took a similar amount of time as using TrackUSF. Figure 3A shows an example of a spectrogram analyzed manually (detected USVs within orange squares), by DeepSqueak (detected USVs within blue squares), and by TrackUSF using two threshold levels (detected USFs marked by asterisks color-coded according to threshold level). As apparent, in this case, DeepSqueak mistakenly defined three distinct USVs as a single one, a mistake that repeated itself multiple times (~10% of detected USVs, see summary in Additional file 2: Table S1). As apparent, TrackUSF detected all USVs in the example by at least one USF. However, detected USFs covered only part of each USV (mainly the areas of stronger signals), in a manner strongly dependent on the threshold level. For a quantitative comparison between the various methods, we employed TrackUSF to analyze the data using five distinct threshold levels (1, 1.5, 2.2, 2.7, and 3.5 a.u.). As expected, the lower the threshold, the longer the time it takes for TrackUSF to analyze the same set of data. In our case, the range was between ~20 min for threshold = 1 to ~10 min for threshold = 3.5. The percent of manually detected USVs which were overlapped by at least one USF varied between 84% in the lowest threshold (1 a.u.) and 46% in the highest threshold (3.5 a.u.), while DeepSqueak performed at 68% (Fig. 3B, Additional file 2: Table S1). The total duration of manually defined USVs that was also occupied by USFs varied between 48% in the lowest threshold and 19% in the highest, while DeepSqueak captured 67% of USV duration (Fig. 3C, Additional file 2: Table S1). This is most likely because many USVs are interrupted by gaps of silence or low amplitude signals, which are counted in their duration by USV-based methodologies but do not contain USFs (see example in Fig. 2Biii).
Nevertheless, we found a statistically significant correlation between the number of manually detected USVs and the number of USFs detected by TrackUSF for all threshold levels (Pearson correlation, R2=0.81, 0.88, 0.91, 0.92, 0.93, respectively, p<0.001 for all; Fig. 3D). We also found that less than <1% of all USFs were false positive (not representing any real USV), regardless of the threshold level (Fig. 3E). In contrast, DeepSqueak had a much higher level of false-positive detections, ranging between 0.3 and 26% for the various audio clips and averaging at 6.9% (Fig. 3E, Additional file 2: Table S1).
To assess the ability of TrackUSF to capture differences in call frequency profiles between animal groups, we compared the PSD analyses of the manually detected USVs and DeepSqueak to those obtained using TrackUSF, separately for the C57BL/6J (Fig. 3F) and BalbC (Fig. 3G) calls. Interestingly, this analysis identified a clear difference between the two strains, with the USVs of C57BL/6J mice showing the main peak at a lower pitch (40 kHz) compared to the higher pitch (60 kHz) of the BalbC USVs. To verify that similar characteristics are identified with TrackUSF (using a threshold of 2.7 a.u.), we scaled the PSD curve of each cluster to the number of USFs in this cluster. We then summed the curves of the scaled clusters separately for C57BL/6J and BalbC mice. This analysis yielded PSD curves that were highly similar to those achieved using the manually extracted USVs (Fig. 3F, G). Notably, a similar analysis of the USVs detected by DeepSqueak from the same recordings yielded very similar PSD profiles for C57Bl/6J mice (Fig. 3F), but was shifted towards higher frequencies compared to manually detected USVs in the case of BalbC mice (Fig. 3G).
Since DeepSqueak enables automatic analysis of two more parameters for the detected USVs, call length and slope, we compared the probability functions of these parameters between the two strains. We found that these parameters do not yield a better separation between the USVs of the two strains than call frequency (Fig. 3H, I). Finally, when employing the ability of DeepSqueak to cluster (using K-means clustering) the detected USVs based on all three parameters (principal frequency, call length, and slope; Fig. 3J), we found that the segregation between the two strains was apparently similar to that achieved using TrackUSF (compare to Fig. 2A), which is based upon signal frequency and amplitude only.
Thus, TrackUSF enables automated and time-efficient analysis of mating calls in mice in a manner that identifies the majority of USVs detected either manually or by DeepSqueak. Accordingly, the number of USFs identified using TrackUSF correlated very well with the number of USVs detected by the USV-based methodologies. Moreover, TrackUSF seems superior over DeepSqueak in its very low false-positive detections and by avoiding joining multiple USVs together. Finally, despite the relatively limited coverage of USV duration by TrackUSF, it accurately captures the spectral characterization of the calls and allows separation between animal groups accordingly.
Validation of the TrackUSF methodology with rat social calls
Adult rats emit a relatively high rate of USVs during social (either male-male or male-female) interactions [6, 8, 33, 34]. These calls are generally divided into two categories. The first type is the “22 kHz aversive calls,” which are associated with negative states and aversive situations and are characterized by low pitch (20–30 kHz) and prolonged duration (150–3000 ms). The second type is the “50 kHz appetitive calls,” which are further divided into flat and highly modulated (trills) USVs and are associated with positive states and appetitive situations. These appetitive calls are characterized by high pitch (40–80 kHz) and short duration (10–150 ms). To assess the efficiency of TrackUSF in analyzing rat social vocalizations, we employed this system (using a threshold of 1) for analyzing USVs emitted during 5-min free social interactions between pairs of male and female Sprague Dawley (SD) rats (n=6 pairs, one 5-min long audio clip per pair). Similarly to mouse calls (Fig. 2A), TrackUSF analysis of rat calls produced two clear clouds of USFs in the t-SNE analysis: one of noise and one of vocalization fragments (Additional file 1: Fig. S2A). Here we used the automatic clustering option of the TrackUSF software (see the “Methods” section) to define various clusters, displayed in distinct colors in Additional file 1: Fig. S2B. Out of the 10 automatically defined clusters, clusters 1–6 comprised noise fragments, while clusters 7–10 comprised vocalization fragments (see examples in Additional file 1: Fig S2C and Fig. 4A). Accordingly, PSD analysis of the various clusters revealed that clusters 7–10 yielded each a well-defined peak in the range of 35–70 kHz (Additional file 1: Fig. S2D).
We then compared the TrackUSF analysis with the analysis of another recently published computerized tool for segmentation of rodent USVs — USVSEG [29]. We employed this tool to the same dataset of rat calls, using the specific parameters defined by the authors for rat pleasant calls [29]. Figure 4A exemplifies the spectrogram of audio segments defined as USFs by TrackUSF (green and orange asterisks) and those defined as USVs by USVSEG (numbered purple framed above the spectrogram). As apparent, TrackUSF detected all USVs in this spectrogram without including any noise segment. In contrast, besides genuine USVs (frames 1, 3–5), USVSEG also detected several noise segments as USVs (frame 2). Moreover, as happened before to DeepSqueak (Fig. 3A), USVSEG defined multiple USVs as one (frame 5). A quantitative comparison of both tools to manual analysis revealed that TrackUSF detected about 60% of the calls (56.5–78.2% relative to manually detected USVs) while USVSEG detected about 100% of them (in some cases even better than the manual analysis) (Fig. 4B). However, this high rate of detection comes with a price, as USVSEG also detected a very high rate (10–90% relative to manually detected USVs) of noise audio segments as USVs (Fig. 4C). Thus, USVSEG yield a very high rate of false-positive detections. In contrast, TrackUSF made a very low rate of false-positive detections (<1%). As for mice USVs, the number of USFs detected by TrackUSF for rat call was linearly correlated with the number of USVs in each audio clip (Pearson correlation, R2=0.9618, p<0.001; Fig. 4D). Finally, PSD analysis of the audio segments detected as USVs by the manual analysis, Track USF, and USVSEG (Fig. 4E) revealed that TrackUSF yielded a very similar PSD profile as the manual analysis, while USVSEG yielded a rather distinct profile with a significant contribution of noise, as reflected by the prominent peak at low (20–30 kHz) frequency.
For further validation of TrackUSF with rat calls, we used it for analyzing male-male calls in two types of settings: (1) during 5-min long free interactions between an adult (subject) and a juvenile (social stimulus) male rats; (2) during a 5-min social preference (SP) test when the juvenile was located within a triangular chamber at one corner of the arena and investigated by the adult male, as previously described by us [35, 36]. Each subject (n=15) was tested for 2–4 sessions in each type of setting. Overall, we recorded audio clips from 45 sessions of free interactions and a similar number of SP tests. Figure 4F depicts the t-SNE analysis of all USFs derived from vocalization recorded in these experiments, color-coded according to the type of session. We used again the automatic clustering option of the TrackUSF software to define various clusters, which are displayed in distinct colors in Fig. 4G. Example spectrograms of vocalization containing USFs of clusters 1, 5, or 9 (each spectrogram contains USFs of a single cluster) are shown in Fig. 4H, while examples representing all other clusters are displayed in Additional file 1: Fig. S2E. As apparent, clusters 1–4 are clearly separate from all other clusters and represent noise. Thus, as was demonstrated above for mouse and rat male-female calls, the noise was readily separated from the vocal signals by TrackUSF. In contrast to clusters 1–4, clusters 5–8 represent vocalizations with the characterization of aversive calls, i.e., prolonged flat vocalizations below 30 kHz, while cluster 9 seems to represent appetitive calls, which are brief and above 50 kHz. This is also apparent from the PSD analysis of the various clusters (Additional file 1: Fig. S2F), shown in Fig. 4I as the mean PSD profiles of clusters 1–4 (noise), clusters 5–8 (aversive calls), and cluster 9 (appetitive calls). Thus, the TrackUSF analysis revealed the same types of aversive, and appetitive vocalizations that are well-known to characterize male-male social interactions in rats. Yet, the clear separation of the aversive calls to several clusters (5–8) suggests small but consistent differences between these calls at the frequency domain, as confirmed by their distinct PSD profiles (Additional file 1: Fig. S2F). The sensitivity of TrackUSF to such changes, especially at the lower range of sound frequency, may be an advantage of TrackUSF over previous techniques.
We further analyzed the numbers of USFs of the various clusters in each session and categorized them according to call type (aversive calls — clusters 5–8; appetitive calls — cluster 9). As shown in Fig. 4J, both types of experimental settings (free interaction and SP test) revealed more sessions enriched with appetitive USFs than sessions enriched with aversive USFs. Also, appetitive USFs are more abundant during free interactions where both animals are free to move in the arena, while aversive USFs are more frequent during SP tests where the juvenile rat is restricted in the triangular chamber. We found a significant difference in abundance between appetitive and aversive USFs in the free interaction setting but not for the SP test, and a significant difference between free interaction and SP test for appetitive USFs but not for aversive calls (Friedman test: χ2 (3)=28.594, p<0.001, post hoc Wilcoxon signed rank test: SP aversive-affiliative: Z=−0.994, p=0.320; free aversive-affiliative: Z=−4.438, p<0.001; SP-free affiliative: Z=−3.782, p<0.001; SP-free aversive: Z=−0.013, p=0.990). Thus, TrackUSF can be used to capture differences in social vocalization activity between distinct experimental settings.
Using TrackUSF for analysis of bat calls in a natural setting
To validate TrackUSF as a mean for analysis of ultrasonic vocalizations of non-rodent animals recorded outside of the lab, we employed it to analyze recordings of bat echolocation calls made in the Hula Valley in Israel, using microphones located at three distinct heights (50 m, 100 m, and 150 m) above ground, hanging on the chord of a large helium balloon. From each height, we manually preselected recordings that contain calls of single bats from two different species: R. microphyllum and P. pipistrellus (typically 2–4 s long). Figure 5A depicts the t-SNE analysis of all USFs identified in these recordings, color-coded according to the recorded bat species and height of recording (7–49 clips for each case, see legend of Fig. 5A). As apparent, the USFs of different species segregate to distinct clusters. We used the automatic clustering option of the TrackUSF software and observed many small clusters representing noise (circled by a gray line in Fig. 5A, B), which was strong and highly variable between the recordings (Fig. 5C). In addition to the noise, three clusters of vocalization were revealed, of which cluster 1 represented recordings of P. pipistrellus, while clusters 2 and 3 mainly represented recordings of R. microphyllum. As shown in Fig. 5C, clusters 2 and 3 represent very similar vocalizations, while cluster 1 represents vocalizations with a different distinct structure and frequency. These differences are also observed in the PSD analysis of the three clusters (Fig. 5D), which revealed very similar peaks for clusters 2 and 3 at 25–26 kHz, probably representing two different individuals (see insets in Fig. 5C), while cluster 1 yielded a prominent peak at ~ 47 kHz. We then counted the USFs detected for each cluster at each height and averaged these numbers separately for each bat species. As apparent in Fig. 5E, for P. pipistrellus, we found almost only USFs of cluster 1, with no difference in USF number between the various heights (Kruskal-Wallis test: χ2 (2) = 1.774, p = 0.412). In contrast, for R. microphyllum, we observed USFs of cluster 3 in all heights (Kruskal-Wallis test: χ2 (2) = 3.816, p = 0.148), while cluster 2 was observed only at 50 m and 100 m, but was absent at 150 m (Kruskal-Wallis test: χ2 (2) = 18.409, p < 0.001). Thus, using TrackUSF, we were able to detect the previously described species-dependent differences in vocalizations [37, 38] without a need to train the system for detecting such differences. Notably, even though the vocalizations composed of USFs from clusters 2 and 3 are almost identical, the use of TrackUSF allowed separating them, showing again its potency in distinguishing between rather similar auditory signals.
TrackUSF reveals modified social vocalizations of Shank3-deficient rats
Following the validation of the TrackUSF methodology, we examined the ability of this system to reveal modified vocalization activity during social interactions in Shank3-deficient rats, a rat model of ASD [39, 40]. To record such USVs, we conducted experiments comprised of 10-min-long encounters between dyads of adult male rats of the same genotype, as described for SD rats above (Fig. 4). About half of the experiments comprised encounters between unfamiliar (novel) animals and the other half between familiar animals (cagemates). Besides the three genotypes of Shank3-deficient rats (wild-type (WT), heterozygous (Het), and homozygous (KO)), we conducted similar experiments with age-matched adult male SD rats. Overall, we recorded 109 experimental sessions. It should be noted that the relatively larger number of Het sessions (See legend of Fig. 6A) reflects their relative abundance in the litters.
All audio clips were pooled together and analyzed by TrackUSF using a threshold of 2.7 a.u. and clusters were defined manually. As apparent (Fig. 6A), some clusters of USFs (e.g., 15, 16) contained a significant representation of all types of experimental sessions (all genotypes and both familiarity levels). Nonetheless, other clusters (e.g., 4–14) contained almost solely USFs of Het or KO Shank3-deficient rats. These results suggested different USVs emitted during social encounters between Shank3-deficient rats and their WT littermates or SD rats. To further examine this possibility, we separately analyzed the USFs represented in each cluster by PSD analysis (Fig. 6B).
As with the former datasets described so far, cluster 1, which was clearly separated from all other clusters and included data from all genotypes, comprised USFs of variable frequencies at the lower range. By examining their appearance in the spectrograms, USFs of cluster 1 were found to be non-vocal sounds (noise, see example in Fig. 6Di), and therefore, this cluster was excluded from all further analyses. Clusters 2 and 3 were also excluded from this and other downstream analyses, as they included USFs originating from only two sessions (the number of sessions representing each cluster is detailed in Fig. 6B legend).
Clusters 15 and 16, which showed wide PSD peaks above 50 kHz (Fig. 6B), contained USFs of brief vocalizations that seem to represent the classical 50-kHz appetitive calls (Fig. 3E iv, v), described above for SD rats. In contrast, clusters 4–14, which contained mainly USFs from Het and KO rats displayed relatively sharp, well-defined peaks between 20 and 40 kHz. By their appearance in the spectrograms (Fig. 3D and E ii–iii), USFs of these clusters seem to represent vocalizations which are between the rat classical appetitive and aversive calls. In fact, in many cases, USFs of these clusters (especially 4–10) appeared in sequences along prolonged flat USVs, which resembled classical aversive calls (Fig. 3D).
In order to examine if USFs from the distinct clusters (4–16) tend to appear in certain combinations, we used TrackUSF to calculate their likelihood to appear before or after a USF from a given cluster (hereafter termed “vicinity”), within a time window of 0.5 s for each direction (Fig. 6E, Additional file 1: Fig. S3, color-coded for each cluster). We found that USFs from clusters 4–14 had variable tendencies to appear in certain combinations (see for example Fig. 6E—left panels, for cluster 9), but the highest likelihood was for the repetitive appearance of USFs from the same cluster, as reflected by the high amplitude of their vicinity peak (middle peak in each representative graph in Fig. 6E and Additional file 1: Fig. S3). We termed this likelihood as “repeatability” and further explored it below. In contrast to clusters 4–14, USFs of clusters 15 and 16 showed almost no vicinity with USFs from other clusters in all genotypes (Fig. 6E — middle and right panels).
Shank3-deficient rats emit higher numbers of low-pitch vocalization fragments
We next examined if the number of detected USFs varied between the various genotypes (Fig. 7A). We found that while all SD rats displayed < 200 USFs, the Shank3-Het, KO, and WT littermates presented a tri-modal distribution, with many of them displaying > 200 USFs. Nevertheless, only a few sessions of WT rats had > 400 USFs, while most KO sessions had > 600 USFs (Fig. 7A). To further explore this tendency, we categorized each session of the three genotypes of Shank3-deficient rats according to the number of detected USFs to low (<600) and high (>600) and examined the proportions of each genotype in these categories separately for cagemates and novel sessions. We found that while WT and Het animals showed a rather similar proportion of 14–27% sessions with > 600 USFs, in KO animals, more than 50% of the sessions were with > 600 USFs (Fig. 7B). This tendency was apparent in all sessions, regardless of the familiarity between the animals (novel animals or cagemates). Statistical analysis revealed a significant difference between the three genotypes (Kruskal-Wallis test: χ2 (5) = 14.874, p = 0.0109), with no familiarity-dependent differences. We therefore combined the two familiarity levels and analyzed the statistical differences between the three genotypes. We found a statistically significant difference between the three genotypes (Kruskal-Wallis test: χ2 (2) = 12.412, p = 0.002). A post hoc analysis revealed a significant difference between KO animals and the two other groups (p-adjusted chi-square, KO:Het — p = 0.003; KO:WT — p = 0.019), with no difference between WT and Het animals.
We further examined this tendency separately for each of the clusters using a slightly more detailed categorization. As shown in Fig. 7C for clusters 4–16, quantities of USFs from clusters 4–14 were differentially distributed between the genotypes. While WT animals showed very restricted numbers of sessions with >50 USFs, KO animals displayed high numbers of such sessions, and Het animals were between WT and KO animals. In contrast, clusters 15 and 16, which seem to represent the 50-kHz appetitive USVs, were similarly distributed between the three distinct genotypes. A closer look into these results suggested a gradient in the number of sessions with >50 USFs between the various clusters, with generally higher numbers of sessions for clusters representing high-pitch calls (Fig. 7C). In agreement with this observation, we found a statistically significant positive correlation (Pearson correlation, R2=0.77, p < 0.0001) between the number of sessions contributing > 10 USFs for a given cluster and the PSD peak frequency of this cluster (Fig. 7E), with high-pitch clusters having more sessions than low-pitch clusters. These results suggest that high-pitch calls are more common among the various sessions.
We also noticed a similar gradient when calculating the probability of USFs to follow or precede other USFs of the same cluster (repeatability) (Fig. 7D). We therefore calculated the half-width of this repeatability curve for each cluster and used it as a proxy for the duration of USVs composed of repeated appearances of the same USF. This analysis was done for Het and KO animals together, as they showed very similar repeatability curves (Additional file 1: Fig. S4), while for WT animals we did not have enough calls to perform such analysis for all clusters. A statistically significant negative correlation (Pearson correlation, R2=0.78, p<0.0001) was found between the PSD peak frequency and repeatability half-width of each cluster (Fig. 7F). Given that wider repeatability half-width is pointing to a longer duration of USVs, this correlation suggests that longer USVs are composed of low-pitch USFs. Taken together, these results suggest that Shank3-deficient animals (Het and KO animals) exhibit a spectrum of modified social vocalizations. Within this spectrum, a stronger modification, exhibited by fewer animals, is reflected by calls that are closer to 22-kHz USVs in both their pitch (low) and duration (extended), while weaker and more common levels of modification are reflected by USVs that are closer to 50-kHz calls in both pitch and duration.