Efficient and robust temporal processing with neural oscillations modulated spiking neural networks

Rhythm-SNN

The proposed Rhythm-SNN utilizes heterogeneous oscillatory signals to modulate the membrane potential update and spike generation. Since these two neuronal dynamics are fundamental to all spiking neuron models, our rhythmic modulation mechanism is applicable across a wide range of such models. Here, we employ the widely used LIF^64,65,66 neuron model for an illustration. Additional details on other recently developed network architectures incorporating our rhythmic modulation mechanism, along with their mathematical formulations, are provided in Supplementary Section 1. For the vanilla LIF neuron, the membrane potential of the i^th neuron in layer l evolves according to:

$${\tau }_{m}\frac{\partial {U}_{i}^{l}}{\partial t}=-\left({U}_{i}^{l}-{U}_{r}\right)+{I}_{i}^{l},$$

(2)

where τ_m denotes the membrane time constant, U_r is the resting potential. ${U}_{i}^{l}$ and ${I}_{i}^{l}$ represent the membrane potential and input current of the neuron, respectively. Once ${U}_{i}^{l}$ exceeds the firing threshold ϑ, the neuron emits a spike ${S}_{i}^{l}$ and its potential is subtracted by the firing threshold. In fact, ${I}_{i}^{l}$ is computed by accumulating the spikes from all its presynaptic neurons, resulting in the following discrete form:

$${I}_{i}^{l}\left[t\right]={\sum}_{j}{w}_{ij}^{l}{S}_{j}^{l-1}\left[t\right]+{b}_{i}^{l},$$

(3)

where ${w}_{ij}^{l}$ represents the connection weight from the presynaptic neuron j in layer l − 1, ${b}_{i}^{l}$ denotes the constant injected current to neuron i, and ${S}_{j}^{l-1}$ signifies the input spike from the presynaptic neuron j.

By employing the zero-order hold (ZOH) method⁶⁷, we could obtain the discrete form of the membrane potential update from its continuous form illustrated in equation (2):

$${U}_{i}^{l}\left[t\right]=\alpha {U}_{i}^{l}\left[t-1\right]+{I}_{i}^{l}\left[t\right]-\vartheta {S}_{i}^{l}\left[t-1\right],$$

(4)

where α ≡ exp(−dt/τ_m) denotes a constant that captures the membrane potential decay, with τ_m as its time constant and dt as the simulation time step.

In contrast, our Rhythm-LIF neuron incorporates rhythmic modulation by utilizing oscillatory signals to modulate the membrane potential update and spike generation. Specifically, the membrane potential update is modulated by the introduced oscillatory signal ${m}_{i}^{l}\left[t\right]$ as follows:

$${U}_{i}^{l}[t]=\left\{\begin{array}{ll}\alpha {U}_{i}^{l}[t-1]+{I}_{i}^{l}[t]-\theta {S}_{i}^{l}[t-1],\quad &{{{\rm{if}}}}\quad {m}_{i}^{l}[t]=1\\ {U}_{i}^{l}[t-1],\quad \hfill&{{{\rm{if}}}}\quad {m}_{i}^{l}[t]=0\end{array}\right.,$$

(5)

where ${U}_{i}^{l}\left[t\right]$ is the membrane potential at time step t. Additionally, the corresponding firing activity is also modulated by the introduced oscillatory signal as described below:

$${S}_{i}^{l}\left[t\right]={m}_{i}^{l}\left[t\right]\, \Theta \left({U}_{i}^{l}\left[t\right]-\theta \right),$$

(6)

where

$${m}_{i}^{l}\left[t\right]=\left\{\begin{array}{ll}1,\quad &{{{\rm{if}}}}\quad 0\le \left(t-\lfloor {\varphi }_{i}^{l}{c}_{i}^{l}\rfloor \right)\,{{{\rm{mod}}}}\,{c}_{i}^{l} < \lfloor {d}_{i}^{l}{c}_{i}^{l}\rfloor \\ 0,\quad &{{{\rm{otherwise}}}}\hfill\end{array}\right..$$

(7)

Here, ${\varphi }_{i}^{l}$, ${c}_{i}^{l}$ and ${d}_{i}^{l}$ denote the initial phase, rhythm period, and duty cycle of the modulating signal, respectively; ⌊ ⋅ ⌋ represents the floor function; and Θ( ⋅ ) is the Heaviside step function, defined as Θ(x) = 1 for x ≥ 0 and Θ(x) = 0 for x < 0. Through this modulation mechanism, when ${m}_{i}^{l}$ equals zero, the neuron neither integrates input current from its presynaptic neurons nor emits spikes to its postsynaptic neurons, corresponding to an ‘inactivate’ state. Conversely, when ${m}_{i}^{l}$ equals one, the neuron adheres to the original dynamics of a conventional spiking neuron, representing an ‘activate’ state. The detailed computational graph of our proposed rhythmic spiking neuron is illustrated in Supplementary Fig. S1.

Collectively, the neuronal dynamics of the LIF model and the proposed Rhythm-LIF model can be summarized as follows:

$$\begin{array}{cc}{{{\bf{LIF}}}}\,{{{\bf{model}}}}&{{{\bf{Rhythm}}}}-{{{\bf{LIF}}}}\,{{{\bf{model}}}}\\ \left\{\begin{array}{l}{I}_{i}^{l}[t]={\sum }_{j}{w}_{ij}^{l}{S}_{j}^{l-1}[t]+{b}_{i}^{l}\hfill \\ {U}_{i}^{l}[t]=\alpha {U}_{i}^{l}[t-1]+{I}_{i}^{l}[t]-\theta {S}_{i}^{l}[t-1]\\ {S}_{i}^{l}[t]=\Theta \left({U}_{i}^{l}[t]-\theta \right)\hfill \\ \end{array}\right.&v.s.\quad \left\{\begin{array}{l}{I}_{i}^{l}[t]={\sum }_{j}{w}_{ij}^{l}{S}_{j}^{l-1}[t]+{b}_{i}^{l}\hfill \\ {U}_{i}^{l}[t]=\left\{\begin{array}{ll}\alpha {U}_{i}^{l}[t-1]+{I}_{i}^{l}[t]-\theta {S}_{i}^{l}[t-1],\quad &{{{\rm{if}}}}\quad {m}_{i}^{l}[t]=1\\ {U}_{i}^{l}[t-1],\quad &{{{\rm{if}}}}\quad {m}_{i}^{l}[t]=0\end{array}\right.\\ {S}_{i}^{l}[t]={m}_{i}^{l}[t]\Theta \left({U}_{i}^{l}[t]-\theta \right)\hfill \\ {m}_{i}^{l}[t]=\left\{\begin{array}{l}1,\hfill \quad {{{\rm{if}}}}\quad 0\le (t-\lfloor {\varphi }_{i}^{l}{c}_{i}^{l}\rfloor )\,{{{\rm{mod}}}}\,{c}_{i}^{l} < \lfloor {d}_{i}^{l}{c}_{i}^{l}\rfloor \quad \\ 0,\quad{{{\rm{otherwise}}}}\hfill\end{array}\right.\hfill \end{array}\right.\,.\end{array}$$

(8)

Heterogeneous oscillation signals

To emulate the multiscale characteristics of neural oscillations, we parameterize the modulating signal ${m}_{i}^{l}$ by sampling its hyperparameters from diverse distributions. Specifically, given a Rhythm-SNN with L layers, the rhythmic parameters, i.e., the rhythm period ${c}_{i}^{l}$, duty cycle ${d}_{i}^{l}$, and phase ${\varphi }_{i}^{l}$ of the oscillatory signal ${m}_{i}^{l}$ of a neuron i in the layer l are generated through:

$$\left\{\begin{array}{c}{c}_{i}^{l} \sim {{{\mathcal{U}}}}\left({c}_{\min }^{l},{c}_{\max }^{l}\right)\\ {d}_{i}^{l} \sim {{{\mathcal{U}}}}\left({d}_{\min }^{l},{d}_{\max }^{l}\right)\\ {\varphi }_{i}^{l} \sim {{{\mathcal{U}}}}\left({\varphi }_{\min }^{l},{\varphi }_{\max }^{l}\right)\end{array}\right.\quad {{{\rm{with}}}}\quad \left\{\begin{array}{c}1\le {c}_{\min }^{l}\le T\\ 0 < {d}_{\min }^{l}\le {d}_{\max }^{l}\le 1\\ 0\le {\varphi }_{\min }^{l}\le {\varphi }_{\max }^{l}\le 1\end{array}\right.,$$

(9)

where T represents the total number of time steps and ${{{\mathcal{U}}}}$ denotes the uniform distribution. Here, we use the parameters ${c}_{\min }^{l}$, ${c}_{\max }^{l}$, ${d}_{\min }^{l}$, ${d}_{\max }^{l}$, ${\varphi }_{\min }^{l}$, and ${\varphi }_{\max }^{l}$ to define the range of the uniform distributions, which subsequently control the characteristics of the generated oscillatory signals. Since ${d}_{i}^{l}$ and ${\varphi }_{i}^{l}$ control the fraction of the duty cycle and the phase within a rhythm period, their values are constrained within the intervals (0,1] and [0,1], respectively. An ablation study on the impact of rhythm hyperparameters on Rhythm-SNNs’ performance is presented in Supplementary Figs. S13 and S14.

Training method for Rhythm-SNN

We use the backpropagation through time (BPTT) algorithm, combined with the surrogate gradient method^68,69,70,71, to train the proposed Rhythm-SNN. During the training process, both the synaptic weights W and the constant injected current b are optimized. By applying the chain rule across both spatial and temporal dimensions, the derivatives of the loss function ${{{\mathcal{L}}}}$ with respect to the spike S can be formalized as follows:

$$\frac{\partial {{{\mathcal{L}}}}}{\partial {S}_{i}^{l}\left[t\right]} ={\sum}_{j}\frac{\partial {{{\mathcal{L}}}}}{\partial {S}_{j}^{l+1}\left[t\right]}\frac{\partial {S}_{j}^{l+1}\left[t\right]}{\partial {S}_{i}^{l}\left[t\right]}+\frac{\partial {{{\mathcal{L}}}}}{\partial {S}_{i}^{l}\left[t+1\right]}\frac{\partial {S}_{i}^{l}\left[t+1\right]}{\partial {S}_{i}^{l}\left[t\right]}\\ ={\sum}_{j}\frac{\partial {{{\mathcal{L}}}}}{\partial {S}_{j}^{l+1}\left[t\right]}\frac{\partial {S}_{j}^{l+1}\left[t\right]}{\partial {U}_{j}^{l+1}\left[t\right]}\frac{\partial {U}_{j}^{l+1}\left[t\right]}{\partial {S}_{i}^{l}\left[t\right]}+\frac{\partial {{{\mathcal{L}}}}}{\partial {S}_{i}^{l}\left[t+1\right]}\frac{\partial {S}_{i}^{l}\left[t+1\right]}{\partial {U}_{i}^{l}\left[t+1\right]}\frac{\partial {U}_{i}^{l}\left[t+1\right]}{\partial {S}_{i}^{l}\left[t\right]}\\ ={\sum}_{j}\frac{\partial {{{\mathcal{L}}}}}{\partial {S}_{j}^{l+1}\left[t\right]}\frac{\partial {S}_{j}^{l+1}\left[t\right]}{\partial {U}_{j}^{l+1}\left[t\right]}{m}_{j}^{l}\left[t\right]{w}_{ji}^{l}+\frac{\partial {{{\mathcal{L}}}}}{\partial {S}_{i}^{l}\left[t+1\right]}\frac{\partial {S}_{i}^{l}\left[t+1\right]}{\partial {U}_{i}^{l}\left[t+1\right]}\theta {m}_{i}^{l}\left[t+1\right].$$

(10)

Note that on the right-hand side of equation (10), the first term denotes the derivatives in the spatial dimension and the second term represents the derivatives in the temporal dimension. Similarly, the derivatives of the loss function with respect to the membrane potential U can be obtained by:

$$\frac{\partial {{{\mathcal{L}}}}}{\partial {U}_{i}^{l}\left[t\right]} =\frac{\partial {{{\mathcal{L}}}}}{\partial {S}_{i}^{l}\left[t\right]}\frac{\partial {S}_{i}^{l}\left[t\right]}{\partial {U}_{i}^{l}\left[t\right]}+\frac{\partial {{{\mathcal{L}}}}}{\partial {U}_{i}^{l}\left[t+1\right]}\frac{\partial {U}_{i}^{l}\left[t+1\right]}{\partial {U}_{i}^{l}\left[t\right]}\\ =\frac{\partial {{{\mathcal{L}}}}}{\partial {S}_{i}^{l}\left[t\right]}\frac{\partial {S}_{i}^{l}\left[t\right]}{\partial {U}_{i}^{l}\left[t\right]}+\frac{\partial {{{\mathcal{L}}}}}{\partial {U}_{i}^{l}\left[t+1\right]}\left(1-\left(1-\alpha \right){m}_{i}^{l}\left[t+1\right]\right).$$

(11)

By employing equations (10) and (11) iteratively backward in time, the derivatives $\frac{\partial {{{\mathcal{L}}}}}{\partial {b}^{l}}$ and $\frac{\partial {{{\mathcal{L}}}}}{\partial {W}^{l}}$ can be easily obtained as per:

$$\frac{\partial {{{\mathcal{L}}}}}{\partial {b}^{l}} ={\sum}_{t=1}^{T}\frac{\partial {{{\mathcal{L}}}}}{\partial {U}^{l}[t]}\frac{\partial {U}^{t}[t]}{\partial {b}^{l}}\\ ={\sum}_{t=1}^{T}\frac{\partial {{{\mathcal{L}}}}}{\partial {U}^{l}[t]}{m}^{l}\left[t\right],$$

(12)

$$\frac{\partial {{{\mathcal{L}}}}}{\partial {W}^{l}} ={\sum}_{t=1}^{T}\frac{\partial {{{\mathcal{L}}}}}{\partial {U}^{l}\left[t\right]}\frac{\partial {U}^{l}\left[t\right]}{\partial {W}^{l}}\\ ={\sum}_{t=1}^{T}\frac{\partial {{{\mathcal{L}}}}}{\partial {U}^{l}\left[t\right]}\frac{\partial {U}^{l}\left[t\right]}{\partial {I}^{l}\left[t\right]}\frac{\partial {I}^{l}\left[t\right]}{\partial {W}^{l}}\\ ={\sum}_{t=1}^{T}\frac{\partial {{{\mathcal{L}}}}}{\partial {U}^{l}\left[t\right]}{m}^{l}\left[t\right]{S}^{l-1}\left[t\right].$$

(13)

We use a rectangular surrogate function^68,70 to approximate the non-differentiable spike activation function Θ( ⋅ ) during training, which is defined as follows:

$$\frac{\partial {S}_{i}^{l}}{\partial {U}_{i}^{l}}\approx h\left({U}_{i}^{l}\right)={{{\rm{sign}}}}\left(\left\vert {U}_{i}^{l}-\theta \right\vert < \frac{k}{2}\right),$$

(14)

where h( ⋅ ) represents the rectangular surrogate function. k is a hyperparameter that controls the range of the gradient flow. k is set to 0.6, in accordance with prior work⁴⁶. sign( ⋅ ) denotes the sign function.

Mitigating the gradient vanishing problem with Rhythm-SNN

We next demonstrate that Rhythm-SNN can effectively address the gradient vanishing problem, a significant challenge faced by existing SNN models. To illustrate this, we first analyze the backpropagation of gradient information from time t to an arbitrary time step $t+{c}_{i}^{l}$ in Rhythm-SNN, and compare it with a non-Rhythm-SNN that does not incorporate the rhythmic modulation mechanism.

According to equation (11), the derivative of the loss regarding the membrane potential at time step t in our Rhythm-SNN can be calculated by the following recursive formula:

$$\begin{array}{rcl} \frac{\partial {{{{\mathcal{L}}}}}}{\partial U_{i}^{l}\left[ t \right]} &=&\frac{\partial {{{{\mathcal{L}}}}}}{\partial S_{i}^{l}\left[ t \right]}\frac{\partial S_{i}^{l}\left[ t \right]}{\partial U_{i}^{l}\left[ t \right]}+\frac{\partial {{{{\mathcal{L}}}}}}{\partial U_{i}^{l}\left[ t+1 \right]}\left( 1-\left( 1-\alpha \right) m_{i}^{l}\left[ t+1 \right] \right) \hfill \\ &=&\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\frac{\partial {{{{\mathcal{L}}}}}}{\partial S_{i}^{l}\left[ t \right]}\frac{\partial S_{i}^{l}\left[ t \right]}{\partial U_{i}^{l}\left[ t \right]}+\sum\limits_{j=t+1}^{t+c_{i}^{l}-1}{\frac{\partial {{{{\mathcal{L}}}}}}{\partial S_{i}^{l}\left[ j \right]}\frac{\partial S_{i}^{l}\left[ j \right]}{\partial U_{i}^{l}\left[ j \right]}}\prod\limits_{k=t+1}^j{\left( 1-\left( 1-\alpha \right) m_{i}^{l}\left[ k \right] \right)}+\frac{\partial {{{{\mathcal{L}}}}}}{\partial U_{i}^{l}\left[ t+c_{i}^{l} \right]}\prod\limits_{j=t+1}^{t+c_{i}^{l}}{\left( 1-\left( 1-\alpha \right) m_{i}^{l}\left[ j \right] \right)}\\ &=&\underbrace{\sum\limits_{j=t}^{t+d_{i}^{l}c_{i}^{l}}{\frac{\partial {{{{\mathcal{L}}}}}}{\partial S_{i}^{l}\left[ j \right]}\frac{\partial S_{i}^{l}\left[ j \right]}{\partial U_{i}^{l}\left[ j \right]}\alpha ^{j-t}}+\sum\limits_{j=t+d_{i}^{l}c_{i}^{l}+1}^{t+c_{i}^{l}-1}{\frac{\partial {{{{\mathcal{L}}}}}}{\partial S_{i}^{l}\left[ j \right]}\frac{\partial S_{i}^{l}\left[ j \right]}{\partial U_{i}^{l}\left[ j \right]}}}_{{{{{\mathrm{gradient}}}}} \, {{{{\mathrm{propagate}}}}} \, {{{{\mathrm{before}}}}} \, {{{{\mathrm{time}}}}} \, t+c_{i}^{l}}+\underbrace{\left( \alpha ^{c_{i}^{l}} \right)^{{{{{\boldsymbol{d}}}}}_{{{{{\boldsymbol{i}}}}}}^{{{{{\boldsymbol{l}}}}}}}\frac{\partial {{{{\mathcal{L}}}}}}{\partial U_{i}^{l}\left[ t+c_{i}^{l} \right]}}_{{{{{\mathrm{gradient}}}}} \, {{{{\mathrm{propagate}}}}} \, {{{{\mathrm{from}}}}} \, {{{{\mathrm{time}}}}} \, t+c_{i}^{l}}\left( 0 < \alpha,d_{i}^{l}\le 1\le c_{i}^{l} \right). \end{array}$$

(15)

Similarly, the derivative of the loss with respect to the membrane potential in the non-Rhythm-SNN is calculated as:

(16)

Given 0 < α < 1, a comparison of the coefficients of the last term in equations (15) and (16) reveals that the duty cycle ${d}_{i}^{l}$ can effectively mitigate the gradient vanishing issue during backpropagation. This property helps preserve information over a longer time span, thereby enhancing the ability to capture long-term dependencies.

Analysis of the memory capacity of Rhythm-SNN

In this part, we demonstrate the enhanced memory capacity of Rhythm-SNNs over non-Rhythm-SNNs. First, we introduce a memory capacity metric used in non-spiking RNNs²⁷, called the mean recurrent length, which captures the average distance between inputs and outputs of the network model over multiple timescales within a cyclic period. Later, we use this metric to compare the memory capacity of Rhythm-SNNs with non-Rhythm-SNNs (see “Proposition 1”).

Definition (mean recurrent length)

Consider the minimum path length ${{{{\mathcal{D}}}}}_{t}(n)$ from an input neuron at time t to an output neuron at time t + n. The minimum path length here refers to the shortest path length of a signal propagating across a time span of n and a network depth of L. For an SNN with cyclic period C, its mean recurrent length is defined as:

$$\bar{{{{\mathcal{D}}}}}=\frac{1}{C}{\sum}_{n=1}^{C}{\max }_{t}{{{{\mathcal{D}}}}}_{t}(n).$$

(17)

Proposition 1

Consider a Rhythm-SNN consisting of L layers, with rhythm periods of oscillatory signals ranging from c₁ to c_k, where c₁ and c_k are the minimum and maximum rhythm periods, respectively. The mean recurrent length of the Rhythm-SNN is less than that of the non-Rhythm-SNN.

Proof

For a Rhythm-SNN with rhythm periods ranging from c₁ to c_k, its cyclic period C can be calculated as follows:

$$C=\,{{\mbox{lcm}}}\,\left({c}_{1},\cdots \,,{c}_{k}\right),$$

(18)

where lcm signifies the least common multiple of c₁, ⋯ , c_k, and c₁≤ ⋯ ≤c_k.

If we unfold the information propagation paths from the input neuron to the output neuron through spatial and temporal dimensions, any path from the input neuron to the output neuron spanning n time steps yields:

$${{{{\mathcal{D}}}}}_{t}\left(n\right)=\left\{\begin{array}{ll}{r}_{t}\left(n\right)+L,\quad &\,{{\mbox{if}}}\,\quad n < C\\ \frac{C}{{c}_{k}}+L,\quad &\,{{\mbox{if}}}\,\quad n=C\end{array}\right.,$$

(19)

where ${r}_{t}\left(n\right)$ represents the shortest temporal path between the input neuron at time t and the output neuron at time t + n. Deriving ${r}_{t}\left(n\right)$ equates to solving the common change-making problem. Given denominations $\left\{{c}_{1},\cdots \,,{c}_{k}\right\}$ and an amount n, the goal is to minimize the number of denominations summing to n. Formally, ${r}_{t}\left(n\right)$ satisfies:

$${r}_{t}\left(n\right)=\min {\sum}_{j=1}^{k}{a}_{j},\quad \,\,\,{{\mbox{s}}}.{{\mbox{t}}}\,.{\sum}_{j=1}^{k}{a}_{j}{c}_{j}=n,$$

(20)

where a_j represents the number of banknotes of denomination c_j. Following the prior work²⁷, we use a greedy strategy to obtain an upper bound for ${r}_{t}\left(n\right)$, thereby avoiding the complex process of solving the original integer linear programming problem in equation (20). This yields:

$${r}_{t}\left(n\right)\le \frac{n}{{c}_{k}}.$$

(21)

According to equations (17), (19), and (21), the upper bound for the mean recurrent length of the Rhythm-SNN with L layers is obtained as:

$$\begin{array}{rcl}\bar{{{{\mathcal{D}}}}}&\le &\frac{1}{C}\left(\frac{1+\cdots+C}{{c}_{k}}+CL\right)\\ &=&\frac{C+1}{2{c}_{k}}+L.\end{array}$$

(22)

For an L-layered non-Rhythm-SNN, we have:

$${{{{\mathcal{D}}}}}_{t}\left(n\right)=n+L,\quad \forall n\le C.$$

(23)

Therefore, its mean recurrent length is:

$$\bar{{{{\mathcal{D}}}}} =\frac{1}{C}{\sum}_{n=1}^{C}n+L\\ =\frac{C+1}{2}+L.$$

(24)

Given that 1 < c_k ≤ C, we have $\frac{C+1}{2{c}_{k}}+L < \frac{C+1}{2}+L$. It indicates that the mean recurrent length of the Rhythm-SNN is smaller than that of the conventional SNN. According to ref. ²⁷, a shorter mean recurrent length implies a higher network memory capacity. This can be elucidated as past information propagates along fewer edges, thus experiencing less attenuation. Consequently, the network’s memory capacity is enhanced by leveraging the proposed modulation mechanism.

Robustness analysis

We analyze the robustness of Rhythm-SNNs to various perturbations by comparing the representation distance between output spike trains in response to original patterns and those of corresponding perturbed patterns. Input perturbations mainly include adversarial attacks and random noise. Since the spiking Lipschitz constant²⁸ provides a uniform bound on the network’s vulnerability to input perturbations, regardless of noise types, Rhythm-SNN’s robustness against these perturbations can be validated by analyzing the spiking Lipschitz constant corresponding to the distance between output spike trains.

For a Rhythm-SNN with L layers, the output spike train of the l^th layer can be represented as ${{{{\bf{S}}}}}^{l}=\left\{{{{{\bf{s}}}}}^{l}\left[t\right]| t=1,2,\cdots \,,T\right\}\in {\Omega }^{T\times {N}_{l}}\left(\Omega \in \{0,1\}\right)$, where T is the inference time step, and N_l is the number of spiking neurons in layer l. We quantify the distance between the original and perturbed activations using:

$${D}_{p}\left({{{{\bf{S}}}}}^{l},{\hat{{{{\bf{S}}}}}}^{l}\right)={\left\Vert {{{{\bf{S}}}}}^{l}-{\hat{{{{\bf{S}}}}}}^{l}\right\Vert }_{p}={\left({\sum}_{t=1}^{T}{\left\Vert {{{{\bf{s}}}}}^{l}\left[t\right]-{\hat{{{{\bf{s}}}}}}^{l}\left[t\right]\right\Vert }_{p}^{p}\right)}^{1/p},$$

(25)

where ${\hat{{{{\bf{S}}}}}}^{l}$ is the output spike train after perturbing the original input, and ∥ ⋅ ∥_p is the matrix norm induced by the vector l_p norm.

Previous studies^72,73,74 have established the theoretical foundation for the vulnerability of neural networks to perturbations, primarily based on the magnitude of activation changes. Recent work²⁸ has further extended this framework to spiking LIF models. We borrow this tool to analyze the distance bound of spike responses in Rhythm-SNNs and compare it with that in non-Rhythm-SNNs. According to prior work²⁸, the upper bound of the distance between the original and perturbed spike trains for conventional SNNs can be described as:

$${D}_{2}{\left({{{{\bf{S}}}}}^{l},{\hat{{{{\bf{S}}}}}}^{l}\right)}^{2}\le \frac{1}{{\theta }^{2}}{{\varLambda }^{l}}^{2}{D}_{2}{\left({{{{\bf{S}}}}}^{l-1},{\hat{{{{\bf{S}}}}}}^{l-1}\right)}^{2}+{\varGamma }^{l},$$

(26)

with

$${\varLambda }^{l}={\sup }_{{\left\Vert {{{\bf{s}}}}\right\Vert }_{2}\le 1,{{{\bf{s}}}}\in {\varPhi }^{{N}_{l-1}}}{\left\Vert {{{{\bf{W}}}}}^{l}{{{\bf{s}}}}\right\Vert }_{2},$$

(27)

$${\varGamma }^{l}=\frac{{N}_{l}T\left(T+1\right)}{\alpha }\left[\frac{{\gamma }^{l}}{\theta }+{\left(\frac{{\gamma }^{l}}{\theta }\right)}^{2}\right],$$

(28)

where ${\gamma }^{l}={\sup }_{{{{\bf{s}}}}\ne 0,{{{\bf{s}}}}\in {\Omega }^{{N}_{l-1}}}{\left\Vert {{{{\bf{W}}}}}^{l}{{{\bf{s}}}}\right\Vert }_{\infty }+{\sup }_{{{{\bf{s}}}}\ne 0,{{{\bf{s}}}}\in {\Omega }^{{N}_{l-1}}}{\left\Vert -{{{{\bf{W}}}}}^{l}{{{\bf{s}}}}\right\Vert }_{\infty }$, Φ = {−1, 0, 1}, Ω = {0, 1}, ϑ represents the firing threshold, W^l is the weight matrix of the layer l, and α signifies the decay factor of the membrane potential. In equation (27), Λ^l, i.e., the Lipschitz constant, mainly bounds the variation of the original and perturbed spike outputs. For the Rhythm-SNN, its corresponding spiking Lipschitz constant can be deduced by equations (3)–(7) (see Supplementary Section 5 for more details):

$${\tilde{\varLambda }}^{l}=\alpha {\sup }_{{\left\Vert {{{\bf{s}}}}\right\Vert }_{2}\le 1,{{{\bf{s}}}}\in {\varPhi }^{{N}_{l-1}}}{\left\Vert {{{{\bf{W}}}}}^{l}{{{\bf{s}}}}\right\Vert }_{2},$$

(29)

Given that 0 < α < 1, the upper bound for the Rhythm-SNN’s spiking Lipschitz constant can be relaxed as follows:

$${\tilde{\varLambda }}^{l} =\alpha {\sup }_{{\left\Vert {{{\bf{s}}}}\right\Vert }_{2}\le 1,{{{\bf{s}}}}\in {\varPhi }^{{N}_{l-1}}}{\left\Vert {{{{\bf{W}}}}}^{l}{{{\bf{s}}}}\right\Vert }_{2}\\ < {\sup }_{{\left\Vert {{{\bf{s}}}}\right\Vert }_{2}\le 1,{{{\bf{s}}}}\in {\varPhi }^{{N}_{l-1}}}{\left\Vert {{{{\bf{W}}}}}^{l}{{{\bf{s}}}}\right\Vert }_{2}={\varLambda }^{l}.$$

(30)

The above comparison shows that our Rhythm-SNN possesses a smaller spiking Lipschitz constant compared to that of the conventional SNN. Since a smaller spiking Lipschitz constant generally leads to a decreased magnitude of network output perturbations^28,72,73,74, it implies an enhanced robustness against perturbations by Rhythm-SNN.

Experimental setup for temporal processing tasks

We conduct experiments on several widely used temporal processing benchmarks, including S-MNIST²⁹, PS-MNIST²⁹, SHD³⁰, ECG³², GSC³¹, VoxCeleb1³³, PTB³⁴, and DVS-Gesture³⁵ to validate the effectiveness of our method.

S-MNIST and PS-MNIST are built by performing a raster scan on the original MNIST digit recognition dataset in a pixel-by-pixel manner, resulting in sequences with a length of 784. Unlike S-MNIST, PS-MNIST applies a random permutation to the pixels of the original image before performing a raster scan, eliminating the original spatial structure. For both tasks, the pixel values are directly fed into the network as injected current to the neurons in the first layer. This layer functions as an encoding layer, converting non-spiking inputs into spiking outputs to enable further processing by SNNs.

The SHD dataset³⁰ comprises approximately 10,000 audio recordings of English and German digits (0–9) from 12 speakers. Each speaker recorded approximately 40 sequences for each digit in both languages, resulting in a total of 10,420 sequences. These audios are transformed into spike-based representations using a bionic inner ear model. Following previous research³⁶, the resulting spike trains are segmented into a sequence of 1000 frames for post-processing by an SNN. The dataset is partitioned into 8156 samples for training and 2264 samples for testing.

The ECG dataset³² contains six types of ECG waveforms, i.e., P, PQ, QR, RS, ST, and TP. We adhere to the data preprocessing procedures outlined in prior work³². Specifically, we apply a variant of the level-crossing encoding method³² on the derivative of the normalized ECG signal to convert the original continuous values into a spike train. Each channel is transformed into two distinct spike trains, representing value-increasing events and value-decreasing events, respectively.

The GSC dataset³¹ consists of 64,727 utterances from 1881 speakers, each pronouncing one of 35 distinct speech commands. In our experiments, we followed the dataset configuration commonly used in other works^32,36, selecting 12 classes from the total of 30 available classes. These include ten specific commands: “Yes”, “No”, “Up”, “Down”, “Left”, “Right”, “On”, “Off”, “Stop”, and “Go”. Additionally, there are two extra classes: an “Unknown” class, which encompasses the remaining 25 commands, and a “Silence” class, created by randomly sampling background noise from the dataset’s audio files. For feature extraction, we followed the preprocessing approach described in prior work³². Specifically, log Mel filter coefficients were computed from the raw audio signals, and their first three derivative orders were extracted. This involved calculating the logarithm of 40 Mel filter coefficients on a Mel scale ranging from 20 Hz to 4 kHz. Each frame of the processed input is represented as a tensor with dimensions 40 × 3, corresponding to the coefficients and their derivatives. The spectrograms are normalized to ensure an appropriate input scale, and each time step in the simulation is set to 10 ms. As a result, each audio sample is transformed into a sequence of 101 frames, with each frame containing 120 channels.

The VoxCeleb1 dataset³³, sourced from YouTube, includes 153,516 utterances from 1251 celebrities with diverse ethnicities, accents, professions, and ages, with balanced speaker gender, resulting in a classification task with 1251 classes. All audio is first converted to single-channel, 16-bit streams at a 16 kHz sampling rate for consistency. Spectrograms are then generated in a sliding window fashion using a Hamming window of width 25 ms and stride 10 ms.

The PTB dataset³⁴ contains 929,000 words for training, 73,000 for validation, and 82,000 for testing, with a vocabulary size of 10,000 words. The text is segmented into sequences of fixed length 200, where each sequence serves as input for models tasked with predicting the subsequent word. To represent the words, we employ an embedding dictionary of size 650, which encodes each word into a dense vector space, capturing both semantic and syntactic relationships.

The DVS-Gesture dataset³⁵ comprises 11 types of hand and arm movements performed by 29 individuals, recorded under three different lighting conditions using a DVS128 camera. Each frame in the dataset is a 128 × 128-sized image with two channels. Each sample in the DVS-Gesture dataset is divided into fixed-duration blocks, with each block averaged to a single frame, resulting in sequences that vary from 500 to 1500 frames depending on the block length.

The training configurations and hyperparameter settings for the above temporal processing tasks are summarized in Supplementary Table 1. We utilize the PyTorch library, which facilitates accelerated model training. All models are trained using the Adam optimizer. Our experiments are conducted using Nvidia GeForce RTX 3090 GPUs, each equipped with 24 GB of memory. In Table 1 of the main text, we provide experiment results of both Rhythm-SNNs and non-Rhythm-SNNs, which employ various spiking neuron models with both feedforward and recurrent architectures. Specifically, the tested models encompass the feedforward SNN (FFSNN)⁷⁵, the SNN with recurrent connections (SRNN)⁷⁰, SRNN complemented with a learnable firing threshold (LSNN)⁴⁰, SRNN complemented with both learnable firing threshold and learnable time constant (ASRNN)⁴³, and the SNN incorporating temporal dendritic heterogeneity (DH-SRNN and DH-SFNN)³⁶. Their Rhythm-SNN counterparts are denoted as Rhythm-FFSNN, Rhythm-SRNN, Rhythm-LSNN, Rhythm-ASRNN, Rhythm-DH-SRNN, and Rhythm-DH-SFNN. The detailed mathematical formulations of these models are provided in Supplementary Section 1.

Experimental setup for the STORE-RECALL task

In this experiment, a 3-layer SRNN architecture is utilized, with each layer comprising 20 neurons. Furthermore, two types of spiking neuron models are examined, including ALIF⁴⁰ and DEXAT⁴⁵ neurons. For ALIF and Rhythm-ALIF models, the membrane potential decay time constant and adaptive threshold time constant are set to 20 ms and 600 ms, respectively. For DEXAT and Rhythm-DEXAT, the membrane potential decay time constant and the two adaptive threshold time constants are set to 20 ms, 30 ms, and 600 ms, respectively. These time constant settings are consistent with prior work⁴⁵, as they have been chosen based on the characteristics of these two models and the task requirements. The mathematical formulations of the proposed Rhythm-ALIF and Rhythm-DEXAT models are provided in Supplementary Section 2. Input signals, composed of characters ‘0’ and ‘1’ along with ‘STORE’ and ‘RECALL’ commands, are encoded into 50 Hz Poisson spike trains by four separate neuron groups. Each neuron group contains 25 neurons and encodes each character/command with a 100 ms time window. Each ‘STORE’ command is followed by a ‘RECALL’ command with a probability of p = 1/6, leading to an average delay of 600 ms between these two commands. The output layer uses a softmax activation function, and the resulting output vector is utilized to calculate the recall error and cross-entropy loss relative to the provided label. Following previous work^44,45, the network is trained for 200 epochs or until the recall error on the validation set drops below 0.05. Detailed training configurations and hyperparameter settings are provided in Supplementary Table 2.

Experimental setup for robustness evaluation tasks

We assess the robustness of Rhythm-SNNs against various perturbations, including Gaussian noise, thermal noise, silence noise, and quantization noise, and two types of adversarial attacks generated using FGSM and PGD. Gaussian noise is characterized by zero mean and variance ranging from (2/255)² to (8/255)². Thermal noise, which affects the input currents to spiking neurons, is simulated by adjusting variance levels from 0.05 to 0.2; silence noise, occurring when a subset of spiking neurons fails to respond, is simulated by randomly masking neuron outputs with failure rates ranging from 5% to 20%; quantization noise, resulting from the conversion of analog signals into digital signals with limited bit resolution, is simulated through post-training quantization, progressively reducing the bit number from 8 down to 2. For gradient-based attacks, the FGSM perturbs input data in the direction of the gradient of the loss relative to the input data, while the PGD operates as an iterative and more potent version of FGSM. Our evaluation is anchored on the temporal processing task employing the PS-MNIST dataset. We conduct experiments with the ASRNN and Rhythm-ASRNN models at various noise and attack levels, comprehensively evaluating their robustness against perturbations. For simplicity, we denote (ϵ/255)² as the variance for Gaussian noise, σ as the variance for thermal noise, p as the masking rate for silence noise, and Bit as the bit resolution for quantization noise. Visual comparisons of the average perturbation distance with respect to the average firing rate changes, as displayed in Fig. 4i–l, are conducted under conditions with ϵ = 8 for Gaussian noise, σ = 0.2 for thermal noise, p = 0.2 for silence noise, and Bits = 6 for quantization noise. More details of the experimental setup and perturbation methods are provided in Supplementary Section 3.

Experimental setup for the speech enhancement task

In this task, the Intel N-DNS Challenge dataset is utilized, which includes 500 h of human speech in various languages and noise types, recorded at 16 kHz and 16-bit depth, with a synthesized signal-to-noise ratio (SNR) ranging from 20 dB to −5 dB. For performance metrics, we use Scale-Invariant Source-to-Noise Ratio (SI-SNR)⁵³ to assess audio quality and DNSMOS⁵⁴ for perceptual evaluation, with the latter considering the overall audio quality (OVR)⁵⁴, speech signal quality (SIG)⁵⁴, and background noise quality (BAK)⁵⁴. Besides, we also evaluate the energy cost of the tested speech enhancement models. The architecture of the Rhythm-GSNN employed in this task consists of a full-band module and three sub-band modules, each of which contains two layers of the Rhythm-GSN model (see Supplementary Fig. S9). Specifically, the noisy audio is divided into three frequency bands after undergoing STFT and normalization, with each band containing an increasing number of frequencies: 32, 96, and 128, respectively. The audio is then fed into the full-band module, and its output features corresponding to each frequency band are processed by their respective sub-band modules. Finally, the features are integrated across the low-frequency, mid-frequency, and high-frequency bands, and the final denoised audio is obtained through iSTFT. More details of the training configurations and hyperparameter settings for the speech enhancement task are provided in Supplementary Table 4 and Supplementary Section 4.

Source link