Biologically-informed excitatory and inhibitory ratio for robust spiking neural network training

Machine Learning


Network accuracy versus initial firing rate

Non-noisy networks using the range of initial conditions, i.e. weight distributions and random seeds, trained on Fashion-MNIST were able to train to over 80% accuracy, which is similar to networks of the same size with unbounded weights14 (Fig. 1c). Accuracy is shown as peak accuracy measured across the 30 epochs of training and we observed no overfitting in the trials we have measured. For networks with E:I ratios 80:20, 95:5, and 100:0, the best performing networks were initialized with weights distributions generating initial firing rates within the biologically-realistic bounds between 0.01 Hz and 25.6 Hz27, leading to an energy-efficient implementation28 while still retaining a high accuracy performance. Networks with a high percentage of inhibitory neurons, e.g. 50:50, were only able to train at the lowest end of the tested activity range, but the accuracy performance was not robust across the 8 repeat trials with different random seeds for the same distributions. In contrast, 100:0 networks exhibited high accuracy (over 80%) at initial firing rates over 50Hz. However, these upper activity levels would not be energy efficient due to unnecessarily high spiking activity levels, nor do they confer additional accuracy benefits. Therefore, we focused on the lower activity spectrum for further analysis.

The SHD accuracy results mirror a similar trend with respect to activity levels and accuracy (Fig. 1d). Lower activity levels correspond to higher accuracy across all E:I ratios. Accuracies reached 45%, which was comparable to the accuracy of the unbounded hidden layer for this network architecture7. An interesting result is that the purely excitatory network trained on SHD show poor accuracy, probably because of the intrinsic high level of input noise in this dataset. The networks with inhibitory neurons train to higher accuracy, which is an indication of the importance of proper ratios of excitatory:inhibitory activity in noisy environment. Compared to Fashion-MNIST the SHD has significantly less immediate separation between classes based on the input spike trains (which are noisier and higher activity). As an example, spike trains differ significantly between the pullover class and sandal class of the Fashion-MNIST dataset (where pullover cases have activity in pixels that are never activated by the sandal cases). Alternatively, SHD does not have this level of separation between classes, combined with the noise from the initial data, resulting in minimal to no immediately decipherable difference between classes. Taken together, networks with biologically-realistic E:I of 80:20 maximized accuracy on both datasets when initialized with low levels of activity.

E:I ratio versus robustness to noisy weight updates

To test the potential for mapping such excitatory:inhibitory networks to noisy hardware, noisy weight updates were implemented by adding a normal distribution of varying standard deviations at every weight update. Each update included a calculated gradient and random noise (defined by \(\sigma _{noise}\)) that were added as a percentage of the initial weight distribution (defined by \(\sigma _{init}\)), which was then bounded to the excitatory or inhibitory values (Fig. 2a). Emerging device technologies, such as oxide-based resistive switches, experience high levels of stochasticity in the weight updates particularly in the high resistance state, therefore we have modeled a broad range \(\sigma _{noise}\) from 0% to 100% of \(\sigma _{init}\)29. Experimental results18 support this range which corresponds to the experimental stochasticity of oxide-based resistive switches in the range of 1% to 100% of the initial resistive states \(10^3 \Omega\) to \(10^6 \Omega\). Training was conducted over a range of E:I ratios initialized with low activity and trained on the Fashion-MNIST dataset. As the noise level increased, the accuracy decreased linearly for higher E:I ratios. Comparing the accuracy across ratios, the lower ratios (e.g. 70:30) performed at higher than 100:0 accuracy until higher noise levels where accuracy dropped off to near random chance (10%) (Fig. 2b). For example, the 50:50 network performance dropped at \(\sigma _{noise}\) equal to 30% of \(\sigma _{init}\), (\(\sigma _{noise} =0.3 \sigma _{init}\)) while other networks performed at 60% accuracy.

To understand the appropriate level of inhibitory activity to train in a noisy weight environment, such as those in neuromorphic hardware, the 100:0, purely excitatory network, provides the baseline for this analysis. Unlike an unconstrained network, the 100:0 network and the remaining E:I ratios provide a separation between the excitatory, glutamatergic, and inhibitory, GABAergic, activity which allows for further potential comparisons to the biology30. This also serves as an insightful investigation for future hardware mapping in memristive neural networks. In a memristive neural network an individual device can physically only carry a positive conductance and has an inherent level of noise. The simplest and most direct translation to hardware is a purely excitatory (positive weight) network. This is in line with prior work done in the literature on nonnegative weight constraints31. We assessed average accuracies across E:I ratios at three different noise levels (Fig. 2c–e). At the lower noise range, \(\sigma _{noise} =0.2 \sigma _{init}\), networks performed near 70% accuracy (Fig. 2c). Comparing the 16 repeat trials with a t-test at each E:I ratio, the 75:25 network performed at a higher accuracy than the 100:0 network at \(\sigma _{noise} =0.2 \sigma _{init}\) (\(p < 0.01\)). As the noise level increased, the 80:20 and 90:10 networks were the highest performing networks (Fig. 2d–e). While previous noiseless Fashion-MNIST results observed the 100:0 network was robust over a larger range of initial activity, these results indicate a biologically realistic ratios of excitatory and inhibitory activity provided optimal performance in the presence of noise.

Fig. 2
figure 2

Accuracy of networks with noisy weight updates. (a) Weight updates where the pre-update distribution determined the gradient for the update. The gradient and noise distributions were then combined with the weight distribution (this example was an \(\sigma _{noise}=0.2\sigma _{init}\)) and bounded to get the post-update weight distribution. (b) Average accuracy of the final 10 of 30 epochs of training with 16 repeat trials for each E:I ratio and noise level combinations. The average across repeat trials is represented with corresponding lines. Each trial is represented with individual points. Average accuracy across trials compared to the 100:0 E:I with positive accuracy differences indicating an improvement over the 100:0 networks at three different \(\sigma _{noise}\) levels: (c) \(0.2\sigma _{init}\), (d) \(0.4\sigma _{init}\), and (e) \(0.6\sigma _{init}\). T-tests were performed to calculate the difference between each E:I ratio and the baseline 100:0 network with * indicating p < 0.05, ** for p < 0.01, and *** for p < 0.001.

Successful versus unsuccessful training

To further understand the network training, the excitatory vs. inhibitory spiking activity can be observed by epoch for all classes to observe any differences between successful and unsuccessful networks. Activity levels before, during, and after training were visualized for the Fashion-MNIST dataset (Fig. 3 and animated in Supplemental Fig. S4). Initial activity spans from less than one hidden layer spike per image on average to over 1000 spikes per image. The initial activity falls along the ratio of the initial networks, in this case 80%. After a single epoch of training, the separation between class accuracy in successful networks (overall accuracy over 50%) was visible across the activity levels (Fig. 3a transition from column 2 to column 3). In this transition, class accuracy sharply divided between high accuracy classes and multiple classes remaining near baseline (observed as points being either green or grey with few points between the two). This gives indication to the clustering of accuracies in Fig. 1c because of repeatedly having few classes training very well as opposed to many classes all training to a moderate level. Additionally, the networks that increased in accuracy have a wider range of percent excitatory activity.

By comparison, networks that failed to train remain close to their original ratios of excitatory and inhibitory activity (Fig. 3b). While some individual classes of a network increased in accuracy (with an increased percent of excitatory activity), other classes sustained low accuracy (with a decreased percent of excitatory activity). The networks that have high initial activity were unable to train regardless of the excitatory activity percentage. While the overall network accuracy remained low, there were a few classes that did train. These few classes follow the same trend of a reduction of activity and an increase in the percent of excitatory activity, similar to the first epoch for successful networks, but occuring later in the training sessions. While this may indicate that the high activity networks may require longer training to reach high accuracy, this likely is not the case given that nearly all classes of the unsuccessful training have an initial decrease in percent excitatory activity, which goes against the typical trend for class accuracy improvements. Overall, this provides further insight on why some networks successfully train on Fashion-MNIST while others do not. In addition, the best training occurs when the percent of excitatory activity increases from baseline, as seen in both the successful networks, and the few classes that trained in the failed networks.

Fig. 3
figure 3

Fashion-MNIST class accuracy relative to the amount of activity and percent of activity that is excitatory across training. Separation is observed between (a) high accuracy networks (>50% maximum overall accuracy) at lower activity levels, and (b) low accuracy networks at high activity levels. (First column) Accuracy convergence curves for all trials colored by E:I ratio. (Second to fifth column) The initial network, after a single epoch of training, after ten epochs of training, and after training is complete (thirty epochs). Each network is represented by 10 points (for the 10 classes of the dataset) and all networks tested from Fig. 1 with E:I ratio of 80:20. Arrows denote the trend of successfully trained classes which exhibit an initial increase in the percent excitatory activity, and then a decline for the remainder of training (indicated with green arrows). Conversely, classes that fail to train are observed to see an initial reduction in the percent excitatory activity (indicated with a grey arrow). Across the successful and unsuccessful networks, successfully trained classes tend toward a lower activity level and higher percent excitatory activity, which is not observed occurring in the unsuccessful networks.

Evolution of E:I activity during training

The interplay of excitatory and inhibitory activity across Fashion-MNIST training classes was further analyzed for three representative trials of low, moderate, and high initial activity, all of which can train to over 50% accuracy (Fig. 4). Although even higher initial activity trials exist, those do not successfully train (see Fig. 1 and 3 for the full activity range). In the low initial activity networks, the first epoch was categorized by a large increase in activity (Fig. 4a). Additionally, the increased activity was disproportionately increased toward excitation, shown in the increased percent of excitatory activity in the first epoch of training. The moderate initial activity trials also displayed initial increased excitation, even though the overall activity remained relatively constant (Fig. 4b). The remainder of training for the low and moderate activity had a continuous incremental increase in inhibitory activity while the excitatory activity remained constant or slowly declined.

This trend was also consistent among classes within the dataset. Due to the conversion process from pixels to the spiking domain, objects that took up larger portions of the 28×28 pixel space generated more network activity. This was observed with the pullover class (labeled 2) having generated over 10 times the hidden layer activity at the end of training compared to the sandal class (labeled 5). The ratio of excitatory and inhibitory activity of lower activity classes tended to be higher than that of higher activity classes.

The example high initial activity shows an initial reduction in the percent of excitatory activity (Fig. 4c). The accuracy also does not increase until near epoch 10, at which time the proportion of excitatory activity for some classes increases. In this case, the classes that increased in proportional activity were the lower activity sandal and sneaker classes, labeled classes 5 and 7, respectively. Progressively as the training continued, similar to the other initial conditions, the excitatory activity decreased, but so did the inhibitory activity.

These general trends for low, moderate, and high initial activity are consistent throughout different E:I ratios. It should be noted that for higher initial ratios, the initial boost in percentage of excitatory activity was not visible since the activity was already at a percentage higher than that seen in the increases observed in the 50:50 trials. Therefore, the 50:50 trials were included here, as the other trends seen in the other E:I ratios are also seen in the 50:50 example. Additional examples for the 80:20 and 95:5 ratios are provided in Supplemental Fig. S2 and S3. Overall, this analysis confers with the previous section that networks train toward a low activity level (approximately one spike per neuron Fig. 4) with an initial increase in the percent excitatory activity that then decreases across training. Here, only the successful networks are analyzed and compared to see trends of an individual network across training. Particularly, specific classes of the dataset train differently in terms of excitatory and inhibitory activity and should be noted for further use of the Fashion-MNIST dataset.

Fig. 4
figure 4

Representative activity of three 50:50 Fashion-MNIST trials. The examples cover (a) low initial activity, (b) moderate initial activity, and (c) higher initial activity. The average number of excitatory (first row) and inhibitory (second row) spikes per image denoting the number of spikes generated by the hidden layer on average given any input image in the Fashion-MNIST dataset. Each line represents a different class with the black line noting the average across all classes. The percentage of excitatory spikes across training (third row) indicates the initial increase in proportional excitatory activity, followed by a decrease for the remainder of training in the low and moderate initial activity conditions. The adjustment of activity from the network training perspective, derived from the weight distributions before (fourth row) and after (fifth row) training. The accuracy convergence curve of each individual trial is included as the inset of the post-training weights.

Distances between neuronal pairs

To further contrast successfully and unsuccessfully trained networks, the Van Rossum distance, a measure of dissimilarity between spike trains, was calculated between pairs of excitatory-excitatory (E-E), excitatory-inhibitory (E-I), and inhibitory-inhibitory (I-I) hidden layer neurons for the Fashion-MNIST and SHD datasets (Fig. 5). By taking the average distance between all pairs across the entire dataset, all trials can be analyzed to compare between the three pair categories. Initially, the distribution of distances was the same regardless of initial excitatory and inhibitory ratio or pair category. The post-training trials are again divided between successful and failed networks. Fashion-MNIST success was defined as peak accuracy being greater than 50%. SHD success was defined as the average accuracy of the last 25 epochs being greater than 30%.

For the Fashion-MNIST distance, successful networks have a lower range of distances compared to the failed networks, corresponding with the overall activity level of the networks since higher activity levels correspond to higher distances (Fig. 5a-c). Successful 80:20 networks had an E-E median distance of 0.890 while failed networks were significantly higher at 1.532 median E-E distance. Additionally, we observed that the categories of distances in successful 50:50 networks were significantly different (p = 0.0091 Kruskal-Wallis, Supplementary Table 1). Interestingly, increased E:I ratios of 80:20 and 95:5 displayed further disparity in distances between the three different categories for successful networks (p < 0.0001, Kruskal-Wallis, Supplemental Table T1). The median Van Rossum distance between E-E neuron pairs was less than that of the I-I neuron pairs for all E:I ratios , with the 50:50 network showing the greatest increase of 39.6% between E-E and I-I median distance. This indicates the activity of inhibitory neurons are further differentiated compared to the excitatory neurons. This parallels the variation in inhibitory neuron types within subregions like the hippocampus, which exhibit a range of subtypes and firing behaviors32,33.

The trend across pair categories within measurements from the SHD dataset remains the same as the Fashion-MNIST dataset with E-E distances being the lowest average distance (Fig. 5d-f). Additionally, the I-I distances have a higher median compared to the E-E pairs (see Supplemental Table T2 for the Kruskal-Wallis statistical analysis). This further indicates the importance of differentiation of inhibitory neural activity in successful networks. In contrast, the range of distances of successful networks after being trained was higher than the initial range of distances, showing a greater increase in distances between initial and final networks on the SHD dataset. Furthermore, the range of distances from successful SHD dataset training were above 1.5, while Fashion-MNIST distances ranged below 1.5, indicating increased complexity of the SHD dataset and the increased need for differentiation of neurons to classify more complex data.

Fig. 5
figure 5

Average Van Rossum distances between hidden layer neuron pairs trained on the Fashion-MNIST (ac) and SHD (df) datasets. Rows correspond to different E:I ratios 50:50 (a,d), 80:20 (b,e), 95:5 (c,f). The average distances between neuron pairs: excitatory-excitatory (E-E), excitatory-inhibitory (E-I), and the inhibitory-inhibitory (I-I) pairs. The distance is averaged across all cases. Each network tested in Fig. 1 is represented by a single average value for each of the three categories and make the distributions in each subfigure. The initial networks spread a range of distances based on their varied weight initializations but show equal distributions across the three categories. The successfully trained networks from the Fashion-MNIST trials are defined as networks with peak overall accuracy >50%, and have a lower distribution of distances, while the networks that fail to reach 50% accuracy are typically in a higher range of activity, with the difference between failure and success being stronger in higher E:I ratios. Successfully trained networks from the SHD trials are defined as networks with average accuracy >30% for the last 25 epochs. Using the Kruskal-Wallis test, the three categories are calculated to be statistically significantly different with p-values <0.0001 for the successful 80:20 and 95:5 networks, with I-I distances being the largest, and E-E distances being the lowest with * indicating p < 0.05, ** for p < 0.01, and *** for p < 0.001.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *