A neuron ID dataset of head neurons
In this study, we focused on the head neurons of an adult animal of the soil nematode C. elegans, which constitute the major neuronal ensemble of this animal [3]. The expression patterns of cell-specific promoters were used as landmarks for cell identification (Fig. 1a). The fluorescent calcium indicator Yellow-Cameleon 2.60 was expressed in a cell-specific manner by using one of the cell-specific promoters and used as a fluorescent landmark. All the neuronal nuclei in these strains were visualized by the red fluorescent protein mCherry. Additionally, the animals were stained by a fluorescent dye, DiR, to label specified 12 sensory neurons following a standard method [24]. The worms were anesthetized by sodium azide and mounted on the agar pad. The volumetric images of the head region of the worm were obtained with a laser scanning confocal microscope. All the nuclei in the images were detected by our image analysis pipeline roiedit3D [10] and corrected manually. The nuclei were annotated based on the expression patterns of fluorescent landmarks.
Finally, we obtained volumetric images of 311 animals with 35 cell-specific promoters in total (Fig. 1b). On average, 203.7 ± 0.52 (mean ± standard error) nuclei were found and 44.2 ± 0.86 (mean ± standard error) nuclei were identified (Fig. 1c). The names and the positions of all identified cells in each animal were summarized in Figshare Dataset S1 [23]. The names of identified cells in each strain are also summarized in Additional file 1: Table S1. These positions and promoter expression information are hereafter called the neuron ID dataset.
Figure 1d shows the names of identified cells and the number of animals (“counts”) in which the cells were identified. In most animals, 12 dye-stained cells and 25 pharyngeal cells were identified, in addition to cells identified by using cell-specific promoters. Finally, we identified a total of 175 out of 196 cells ranging from I1 class neurons (anterior) to AVG neuron (posterior). Out of 182 cells anterior to the retrovesicular ganglion, 171 cells were identified. We did not identify 11 cells including URA class (4 cells), RIS cell, SIA class (2 of 4 cells), AVK class (2 cells), and RMG class (2 cells) because of the lack of suitable cell-specific promoters. We also identified 4 cells in the retrovesicular ganglion including AVG cell, RIF class (1 of 2 cells), and RIG class (2 cells). This result indicates the neuron ID dataset covers a majority of neurons in the head region.
Note that we used H20 promoter as a pan-neuronal promoter [25]. We confirmed H20 promoter was expressed in the GLR glial cells and XXX atypical hypodermal cells by co-expressing the cell-specific promoters nep-2sp and sdf-9p, respectively. We estimated that H20 promoter was also expressed in pharyngeal gland cells and HMC cell, based on their positions. Also, we estimated that H20 promoter was expressed weakly in the hypodermal cells, based on their positions and shape of the nuclei, but we removed HMC and hypodermal cells from our neuron ID dataset. We confirmed H20 promoter was not expressed in the socket cells nor the sheath cells by co-expressing the cell-specific promoter ptr-10p.
Large variation disrupts position-based cell annotation
How large is the variation of the relative positions of the cells between individual animals? To answer this question, we need to first assess the potential sources of the variation. Intuitively, there are several possibilities: (1) placement (translational and rotational) of the worms in the obtained images, (2) curved posture of the worms (body bending), and (3) inherent variation of the cell position. In order to focus on the inherent variation that we are interested in, we considered a few ways to remove the contribution of (1) and (2). Principal component analysis (PCA) and subsequent alignment processes corrected the translation and rotation (see the “Methods” section). The quadratic curve fitting was employed to correct curved posture (Additional file 2: Figure S1). These methods significantly reduced the variation of cell positions (Additional file 3: Figure S2, Additional file 4: Table S2). Note that, after posture correction, the cell distribution is not deformed in the DV-LR plane (Additional file 2: Figure S1C, right panel), indicating that the distortion of worms is negligibly small under our experimental setups. To assess the contribution of temporal variation of cell positions in each animal, we utilized the time-lapse images obtained for whole-brain imaging. A large part of the movement of cells during time-lapse imaging can be removed by translation correction, and the remaining movements are too small to explain the observed variation of cell positions in the neuron ID dataset (Additional file 5: Figure S3, Additional file 6: Table S3).
After removing the contribution of (1) and (2), we compiled the positions of the named cells in the neuron ID dataset. The positions of the nuclei identified as the same cell were collected from the neuron ID dataset (Additional file 7: Figure S4). The mean and the covariance of the positions specify a tri-variate Gaussian distribution, which can be considered as the maximum likelihood estimate of cell position distribution. The three-dimensional ellipsoidal region of 2 standard deviation of the tri-variate Gaussian distribution is shown for each cell, in which about 70% of data points are expected to be included (Fig. 2a). The ellipsoids largely overlap with each other, especially in the lateral ganglia (mid region of the head), because of high variation and high density of the cells. The median distance between distribution centers of neighboring cells was 3.19 ± 0.15 μm (median ± standard error) (Fig. 2b). The median length of the shortest axis of the ellipsoids, equivalent to the twice of the smallest standard deviation, was about 2.69 ± 0.09 μm (median ± standard error). The values of the minimum distance and the shortest axis length were almost the same, indicating that the variation of the position of a cell reaches the mean position of the neighboring cells. Thus, the variations of the cell positions between individual animals are large.
The variations of the cell positions were explored further in a different way. We focused on the variations of the relative positions of neuron pairs. If we fix the position of a cell and align all other cells, the specific-cell-centered landscape can be drawn (Additional file 8: Figure S5). When ASKR cell was centered, the variations of positions of adjacent cells including SMDVR and ADLR were decreased, but that of other cells did not change or rather increased. When MI cell, an anterior pharyngeal neuron, was centered, the variations of positions of pharyngeal cells decreased, but those of other cells generally increased. These results suggest that the variations of relative positions are different depending on neuron pairs. We further obtained the variations of relative positions of all available cell pairs (Fig. 2c). The volume of the ellipsoid of relative positions is regarded as the variation of relative positions. We found clusters of less varying cell pairs (Fig. 2c, red boxes). The clusters include lateral ganglion cell pairs and pharyngeal cell pairs (Additional file 9: Figure S6). On the other hand, there are highly varying cells including RIC, AIZ, and FLP classes.
Where do these variations of cell positions come from? In order to tackle this problem, we performed an additional analysis. The pharynx of worms moves during development, and the position of the pharynx in the anterior-posterior axis differs between individual animals. We found that the positions of dye-positive cells were affected by the positions of the pharynx (Additional file 10: Figure S7, Additional file 11: Table S4); when the pharynx moved anteriorly, all the dye-positive cells (ASK, ADL, ASI, AWB, ASH, and ASJ classes) moved outside in lateral positions. In addition, anterior cells (ASK, ADL, and AWB classes) moved anteriorly, and posterior cells (ASJ class) moved posteriorly. In other words, the pharynx of worms pushed these neurons aside. This result indicates that some part of the inherent variations of cell positions come from the variations in organ placement.
How does the variations of cell positions disrupt the position-based cell annotation? Based on the mixture of the Gaussian distributions (Fig. 2a), the posterior probability of assignments was calculated for the respective cells in the respective animals. The name of the cell was estimated as the name of the Gaussian distribution which has the largest probability at the position of the cell in the target animal. In other words, the cell has the probabilities in which the cell belongs to the Gaussian distributions, and the most likely distribution was assigned for the cell. The error rates of this estimation method were visualized with cell positions (Fig. 2d). The error rates for the cells anterior to the nerve ring were relatively low (44.5% ± 24.0%, mean ± standard deviation), and those for the cells in the ventral ganglion were relatively high (64.8% ± 25.6%, mean ± standard deviation). Mean error rate was 49.7% ± 24.3% (mean ± standard deviation) (see Fig. 4c, described below), indicating that the variations of the cell positions actually disrupt the position-based cell annotation severely.
Optimal combination of the cell-specific promoters increases the number of identified cells in an animal
In order to reduce the error rate of the annotation method, one may want to use the information of fluorescent landmarks [8, 12]. Using multiple landmarks will reduce the error rate. One or two fluorescent channels are often available for the landmarks in addition to the channels required for the whole-brain activity imaging. We therefore sought for the optimal combination of cell-specific promoters for two-channel landmark observation using the neuron ID dataset.
Several properties of the promoters were evaluated in order to choose the optimal combination: the number of cells that are labelled (Fig. 3a, b, Additional file 12: Figure S8), stability of expression (Fig. 3a), sparseness of the expression pattern (Fig. 3b, see the “Methods” section for definition), and overlap of expression patterns in the case of combinations (see Figshare Dataset S1 [23]). Among the 35 tested promoters, eat-4p was selected because it was expressed in the most numerous cells in the head region (Fig. 3a). The promoters dyf-11p and glr-1p were also expressed in numerous cells, and glr-1p was selected as the second promoter because the sparseness of the expression patterns of glr-1p was higher than that of dyf-11p (Fig. 3b) and because the expression patterns of dyf-11p highly overlapped with that of eat-4p. Additionally, ser-2p2 was selected based on the stability of the expression and low overlaps with eat-4p and glr-1p. Thus, the combination of eat-4p, glr-1p, and ser-2p2 was selected (Fig. 3c). The latter two promoters were used with the same fluorescent protein assuming only two fluorescent channels can be used for the landmarks as is the case for our experimental setup for whole-brain imaging. In the neuron ID dataset, eat-4p was expressed in 69 cells and glr-1p + ser-2p2 were expressed in 50 cells out of 196 cells in the head region of adult worms.
All combinations of the promoters could be evaluated by an algorithm that considers the number of expression, sparseness, and overlap of expression patterns (see the “Methods” section and Additional file 13: Table S5). In brief, the algorithm highly evaluated a combination when two neighboring cells were in different colors. In the case of three promoters and two fluorescent channels, the combination consisting of eat-4p, glr-1p, and ser-2p2 was placed in the 18th rank out of the possible 20,825 combinations.
Here, we produced a strain JN3039 as follows. The far-red fluorescent protein tagRFP675 was expressed using eat-4p, and the blue fluorescent protein tagBFP was expressed using glr-1p and ser-2p2. The red fluorescent protein mCherry was expressed using the pan-neuronal promoter H20p. This strain does not use fluorescent channels of CFP, GFP, and YFP and is useful for the cell identification tasks. For example, if there is a strain that expresses one of these fluorescent proteins with a promoter whose expression patterns need to be identified, one can do it by just crossing the strain with our standard strain JN3039.
With the help of the optimized expression of landmark fluorescent proteins in JN3039, the number of identified cells in an animal is expected to increase compared to the strains used to make the neuron ID dataset. Actually, by using the JN3039 strain, we could manually identify 162.5 ± 4.23 (mean ± standard error) nuclei out of 202 ± 1.46 (mean ± standard error) detected nuclei from 15 adult animals (Fig. 3e–g, Additional file 14: Figure S9, Additional file 15: Dataset S2). Ignoring hypodermal cells reduced the number of the identified nuclei to 156.3 ± 3.46 (mean ± standard error). The number of identified cells in the latter case is 3.6 times higher than the average of the neuron ID dataset. We identified a total of 186 out of 196 cells that covers most neurons in the head region (Fig. 3g). These results indicate that our optimized strain is a powerful tool for neuronal annotation, for example for identifying expression patterns of genes of interest.
Additionally, a strain JN3038 was made from the strain JN3039 by expressing fluorescent calcium indicator Yellow-Cameleon 2.60 with the pan-neuronal promoter H20p. This five-colored strain inherits the neuron identification ability of JN3039 and will enable whole-brain activity imaging with annotation. We could manually identify 130.3 ± 4.63 (mean ± standard error) nuclei out of 188 ± 4.08 (mean ± standard error) detected nuclei on average from 12 adult animals (Fig. 3f, g). In total, we identified 171 out of 196 cells. The numbers of identified and detected nuclei were slightly lower than JN3039, likely because of the difference of experimental conditions including resolution of the optics and photobleaching.
We tested the health of JN3038 animals because worms often get sick by introducing multiple transgenes. The JN3038 animals displayed reduced brood size and moved slowly (Additional file 16: Figure S10A-D, Additional file 17: Table S6). Further, we tested whether the animals show normal salt chemotaxis. On a plate with NaCl gradient, worms are attracted to the NaCl concentrations which they have experienced with food [26]. Additionally, they avoid the NaCl concentrations which they have experienced without food [27]. Thus, the salt chemotaxis behavior is a kind of associative learning. Studying the neural mechanism of the behavior will also elucidate that of the learning. The JN3038 animals showed salt chemotaxis and learning ability similar to wildtype animals (Additional file 16: Figure S10E). This result suggests the neural circuit of the animals is healthy enough so that we will be able to dissect neural mechanisms of the learning and the behaviors by using the JN3038 strain.
Annotation algorithm in the computer-assisted semi-automatic annotation framework
Although most neurons in the head region of an animal could be annotated successfully by using our optimized strain, the manual annotation task often takes a long time and requires high expertise. Automatic methods for neural identity annotation will reduce the difficulty. However, such methods will suffer from low accuracy because the positions of the cells show large variations that disrupt the performance of a position-based cell annotation (Fig. 2d). Therefore, we developed an automatic annotation method as a part of a computer-assisted semi-automatic annotation framework. Note that our aim is not to establish a fully automatic annotation method, which will be extremely difficult, but to provide a computer-assisted semi-automatic annotation framework in which the accuracy can be improved through human-machine interaction. An example usage is an ab initio automatic estimation followed by manual correction. The computer estimation can also be used as a help for a decision during manual annotation.
The proposed annotation method consists of three parts: the generation of a large set of atlases (Fig. 4a), bipartite matching, and majority voting (Fig. 4b). The generated atlases capture high-order information of positional variations in the neuron ID dataset. During the bipartite matching step, the positional information is integrated with the additional information including expressions of landmark promoters and human correction. Majority voting reduces the effect of positional variations by exploiting the large set of generated atlases and returns multiple candidates that will reduce the difficulty of the manual annotation.
In order to generate an atlas with fully annotated cells, we combine positional information of cells from multiple partially annotated animals while maintaining the relative position between the cells as much as possible. We assembled the partially identified cell positions in our neuron ID dataset as follows (Fig. 4a, see the “Methods” section for detail):
- A)
Choose an animal 1 and extract the positions of the identified cells.
- B)
Choose an animal 2 and register it to animal 1 based on the positions of commonly identified cells in both animals.
- C)
Map the cells in animal 2 that were not identified in the animal 1 according to the registration, and add them to the identified cell list of animal 1.
- D)
Repeat steps B and C until all the animals were covered.
- E)
Add the positions of cells that were not identified in our neuron ID dataset using the positions of the cells in White’s atlas.
The resulting atlas depends on the order of assembly, which reflects the variation of the cell positions between individual animals. By random sampling of the animals, we generated around 3000 synthetic atlases that were used to reduce the error rate of the estimation.
In the bipartite matching step, the cells in a target animal were assigned to those in an atlas. An assignment of a cell in the target animal to the one in the atlas has a cost based on the similarity (or dissimilarity) of the two cells. The similarity scores are based on several factors including but not limited to Euclidean distance, expressions of landmark promoters, and the feedback from human correction as needed. The optimal combination of the assignments that minimize the sum of the costs was obtained by using the Hungarian algorithm. The name of the cell in the target animal can be estimated as the name of the assigned cell in the atlas in this step.
In the majority voting step, the bipartite matching of the target animal was repeated with different atlases. Each assignment of a cell is considered as a vote, and the most voted assignment was considered as the top rank estimation of annotation. Assuming the generated atlases could capture the positional variations of the cells, the vote of erroneous estimations will be dispersed and the vote of correct estimation will be a top rank.
In order to validate our atlas generation, we compared the position of cells in the generated atlases and the dataset. We visualized the mean and the covariance of the positions of the cells in the atlases (Additional file 18: Figure S11A). The outline of the cell positions in the atlas was similar to that in the dataset, suggesting that our atlases capture the positional variations of the cells in the dataset. The distances of mean positions of the cells between the atlases and the dataset were small (Additional file 18: Figure S11B, 1.43 μm in median), but became large for the rarely detected cells. The covariance of the positions of the cell between the atlas and the dataset was also similar (Additional file 18: Figure S11C), but the covariance for the atlases was larger than that for the dataset when the cell was rarely detected in the dataset. The variation of relative positions of the cell pairs in the atlases was similar to that in the dataset when the cell pair was co-detected in the dataset frequently enough, and became larger when the cell pair was less co-detected (Additional file 18: Figure S11D and Additional file 19: Figure S12). These results indicate that our atlases capture the positional information of the cells and its variations with high-order information (i.e., the variation of relative positions of the cell pairs). These results also suggest the atlases will be more precise when the neuron ID dataset includes a lot more animals and annotated cells.
To validate our automatic annotation method, a five-fold cross-validation test was performed. All the animals in the neuron ID dataset were randomly divided into five subsets. We performed a total of five tests. For each test, we excluded one of the subsets and used the remaining subsets for generation of atlases and used it to estimate the annotation performance of the reserved set based on the generated atlases. The error rate of bipartite matching was relatively high, and the majority voting could deliver significant improvements of the annotation accuracy (Additional file 20: Figure S13). On average, 78.8% ± 6.71% (mean ± standard error) nuclei were annotated and 46.2 ± 2.73 (mean ± standard error) nuclei were successfully estimated as the top rank, and the error rate of the top rank estimation was 40.1% ± 4.69% (mean ± standard error) (Fig. 4b, c). As a control, two methods were introduced; one method only considered the mean and covariance of the cell positions of raw data (without using the atlases and voting, see Fig. 2d). The other method considered the mean and covariance of the cell positions in the atlases (without using majority voting). The error rates of the two methods were higher than those of the proposed method, indicating that the majority voting step in the proposed method contributes to the correct estimation (Fig. 4c). If we considered the accuracy for the top 5 voted estimations (shown as rank 5), the error rate decreased to 8.09% ± 2.24% (mean ± standard error).
The automatic annotation method was applied to the animals with fluorescent landmarks (strain JN3039, see Fig. 3c–e). The error rates of the top rank estimation with and without fluorescent landmarks were 38.5% ± 2.64% (mean ± standard error) and 52.7% ± 1.84% (mean ± standard error), respectively, indicating that utilizing the fluorescent landmarks also contributes to the correct estimation (Fig. 4d). If we considered the accuracy for the top 5 voted estimations, the error rate decreased to 8.40% ± 1.95% (mean ± standard error). These error rates were comparable to the cross-validation results for the neuron ID dataset, suggesting that our annotation framework will work correctly for the whole-brain activity imaging.
In our neuron ID dataset, dye-positive 12 neurons and 25 pharyngeal cells were overrepresented because of our neuron identification strategy. We confirmed that this imbalance had little impact on the error rate of the automatic annotation for cells identified in JN3039 (Additional file 21: Figure S14), indicating the robustness of our automatic annotation method.
The automatic annotation method was also applied to the animals in a microfluidic chip for whole-brain activity imaging (Additional file 22: Figure S15). The error rates of the top rank estimation with and without fluorescent landmark were 52.1% ± 2.10% (mean ± standard error) and 72.8% ± 2.16% (mean ± standard error), respectively, and that of the top 5 voted estimations was 12.2% ± 1.32% (mean ± standard error). The worms were compressed and distorted to be held in the microfluidic chips, and the distortion of the worm may increase the error rates. During whole-brain imaging of free-moving animals [12,13,14]), the worms will be less compressed and less distorted, and our algorithm may work better.
Additionally, our algorithm was implemented in the GUI roiedit3d [10], and it can handle feedback information from the human annotations (Fig. 4e). One can choose the correct annotation from several estimated candidates. Once annotations are corrected manually, our automatic annotation method can accept corrections and can use them to improve the annotation of other neurons as well, because assignments inconsistent with manual annotation are not allowed in the bipartite graph matching step in our algorithm. Thus, our semi-automatic framework reduces the difficulty of annotation tasks. For another example, one can identify neurons manually by using other information including the neural activity or morphology; then, the automatic estimation for the other neurons will be improved. The final results can be added to the neuron ID dataset, and the annotation algorithm will work more accurately. Thus, the feedback system incorporates tacit knowledge into the automatic annotation method. A detailed step-by-step tutorial for semi-automatic annotation using RoiEdit3D is provided as Additional file 23: Dataset S3.
Finally, we tested the effect of the manual correction to the error rate of the automatic annotation method. We randomly selected a wrong annotation of a cell in rank 1 estimation for JN3039 (see Fig. 4d), corrected the annotation, and re-ran the automatic annotation method using the manual correction information. This step was repeated sequentially. The error rate decreased to zero when the number of the manual correction increased (Fig. 4f). Manual correction of one cell label led to 1.078 ± 0.0369 (mean ± standard error) increase in the correct cell labelling on average. This result indicates that our computer-assisted semi-automatic annotation framework corrected an additional 7.8% estimation by using manual correction information. Similarly, we tested the effect of prior manual annotation on the correct rate of automatic annotation. Annotation of a single cell type was specified manually before performing the automatic annotation for cells identified in JN3039 (Additional file 24: Figure S16 and Additional file 25: Figure S17). Prior manual annotation of SMDVL gave the largest improvement of the correct rate (0.59%, about 2 cells). Thus, through the interactive process, our algorithm will make human annotation tasks more efficient.