Hopf physical reservoir computer for reconfigurable sound recognition

This Hopf physical reservoir computer architecture is proposed for real-world edge computing applications, such as audio recognition. Although speech recognition is a relatively simple task for deep neural networks running on the cloud, it is a difficult task for edge computers due to their limited computational power. The proposed architecture effectively uses the strengths of both analog and digital devices by splicing an analog oscillator to a digital neural network. Moreover, the Hopf oscillator can be readily fabricated from commercial off the shelf electrical components.

The Hopf physical reservoir computer architecture discussed in this paper has several distinct differences from other similar physical reservoir computers. Most prominently, this Hopf oscillator is paired with a neural network rather than using a simple ridge regression. By increasing the complexity of the neural network, the Hopf physical reservoir computer is able to perform more difficult tasks. As the neural network is straightforward, it can be easily implemented. The architecture employed in this paper does not use any preprocessing of the original audio data, which significantly reduces the computational costs of the recognition task. Instead, it follows the activation signal to construct the feature maps by matrix reshape and inverse tanh. Usually, the Mel spectrum is used for this type of task, which can account for more than half of the computational load³³. Most nonlinear oscillator-based physical reservoir computers must use time-delayed feedback, which is cumbersome as it would require digital-to-analog and analog-to-digital converters. However, the Hopf oscillator is capable of storing enough information in its dynamic states to avoid this^24,25. Moreover, the presented architecture is robust to noise because of the Hopf oscillator’s nonlinearity, which is important for real-world audio processing applications.

The proposed architecture has several key advantages. First, the computational load for the proposed approach is significantly reduced. The computations involved in the construction of the feature maps are matrix reshape, normalization, and inverse tanh. These operations only consume around 10% of the computational power compared to the Mel spectrogram for a sampling rate of 4,000 Hz. An estimate of the computational load draws the conclusion that similar operations on Cortex-M4 (Arm, San Jose, California) edge devices yields only about 5 ms of the latency running this algorithm. Second, the proposed method can be paired with different machine learning models. Though the paper uses the CNN as the machine learning readout, the feature map yielded from the proposed method can be replaced by common image processing methods, including but not limited to transformer(³⁴), structure similarity index(³⁵), feedforward neural network(³⁶), and Euclidean distance(³⁷), etc. Third, compared to the Mel spectrogram, physically implemented limit cycles can generate features that are robust to both noise and low audio quality. It is worth noting that the audio used for the experiments is a downsampled version, which is about half of the sampling rate used by the Mel + CNN approach, while still achieving an audio recognition accuracy that is approximately 10% higher. As an example of this robustness, the feature map generated from the audio with additional noise (Fig. 5) retains its distinctive features even under extremely low signal-to-noise ratio (< 20).

Summary of results

In this paper, we present the results of sound signal recognition using reservoir computing technology consisting of a Hopf oscillator^24,25. Instead of employing computationally expensive preprocessing (e.g., Mel spectrum) commonly used in other studies^15,17,20,30, we directly take the outputs from the Hopf circuit to process the normalized audio signal for machine learning recognition. We anticipate that this Hopf reservoir computing can be directly implemented to microphones to achieve a future processing-on-the-sensor.

In “Results” Section, we systematically demonstrate that our Hopf reservoir computing approach yields a 10% accuracy improvement on a diverse 10-class urban sound recognition compared to the state-of-the-art results using edge devices³⁰, whereas we use a surprisingly simple preprocessing by just normalizing the original signal. The wake words recognition results in > 99% accuracy using the exact readout machine learning algorithm by only retraining the MLP. This implies that the Hopf reservoir computer will enable inference and reconfiguration on the edge for the sound recognition system. Additionally, compared to other reservoir computing systems (e.g.,^15,16,17,22), the spoken digit dataset yields superior performance without the need of using complex preprocessing, multiple physical devices, or mask functions; in addition, we have also conducted our benchmarking experiments on far more realistic datasets (i.e., the 10-class urban sound recognition dataset and the 4-class wake words dataset). We demonstrate boosted performance of audio signal processing by changing the activation signal strength of the Hopf oscillator, which implies that there are more degrees of freedom for reconfiguring physical reservoir computers as compared to other reservoir implementations.

Lastly, we carefully crafted the algorithms and preprocessing of the data for sound recognition tasks to keep overall energy consumption, including the digital readout, less than 1 mW based on FLOPS operations and the analog sampling rate. The computational load, which uses less than 700 sound clips of a 10-class dataset for training machine learning models, is well below the envelope of the computational resources possessed by consumer electronic devices. As such, the sound recognition devices using a Hopf reservoir computer could have an effortless integration with devices with untraceable computational load increases.

Analysis on the physical mechanisms of the Hopf reservoir computer sound recognition

Three elements play important roles in the audio signal recognition. The limit cycle system creates an oscillation signal in the temporal domain with a sinusoidal form, which continuously convolves with the incoming audio signal. This convolution is reminiscent of the Fourier transform, and the Hopf oscillator generates unique patterns for audio recognition (e.g., Fig. 2). Interestingly, this process largely replicates the process of the cochlea in extracting the sound signal features perceptible by the neurons. The nonlinear oscillation of the Hopf oscillator in the temporal direction creates nodal connections of the reservoir computer, corresponding to the neuron connections in DNN. Additionally, the nonlinearity of the Hopf oscillator causes it to respond differently to signals possessing various characteristic features of the audio in a broadband fashion, which produces clean separation of features (Figs. 2 and 7a).It is worth noting that some recent studies^38,39 have demonstrated that the cochlea and its directly-connected neurons create a limit cycle system using the previous audio signals as activation to dynamically enhance the performance of the cochlea in performing audio signal feature extraction. The physical model of the inner ear can be modeled as a Hopf oscillator with a time-delayed feedback loop using the signals from previous time instants to activate the limit cycle oscillations. The audio signal recognition actually happens in the inner ear instead of in the brain. An interesting future extension of this work is to explore different activation signals to create an artificial ear, which is capable of on-membrane audio recognition. In the meantime, the two states of the Hopf oscillator affect each other with a time delay, which enhances the memory effects essential to the time series signal processing.

Discussion and future work

The unique advantages of the Hopf reservoir computer demonstrated in this paper pave the way for the next generation of smart IoT devices that exploit the unused computational power in sensor networks. Specifically, the physical mechanisms backing reservoir computing also happen in the microphone membrane with carefully crafted activation signals³⁸. One could imagine that future microphones directly operate sound signal recognition using sensor mechanisms instead of dedicated processing rigs. In addition, as shown in Fig. 2, the feature map of sound signals consists of unique patterns that are recognized by a convolutional neural network commonly used for visual signal processing. An extension of the present work will explore the correlations of audio signal feature maps, visual signal feature maps, and other types of time series data features. As such, reservoir computing could be used as a backbone for multi-modal machine learning in smart IoT paradigms, including sensor fusion, audio video signal combination, and decentralized machine learning. The extremely small amount of training data required for the machine learning operation and clear feature separation described in “Results” Section could offer surprisingly satisfactory results, which is essential for many use cases without the luxury of unlimited sizes of datasets (e.g., soft user identification) or with noisy environments (e.g., a mix of different signals). One example is shown in Fig. 10: a eight-second long audio signal consisting of multiple different (i.e., car horn, drilling, and siren) is used to demonstrate the proof-of-concept of Hopf reservoir computer on mixed signal processing. The first four seconds of the audio clip only have car horn and drilling sound. For the last four seconds, the siren sound is added with a higher amplitude. As shown in the figure, the audio features generated from the Hopf reservoir computer has a clearly dominant class on the second half of the data and exhibits visually high correlation with the audio features generated by a clean siren sound with the same Hopf reservoir computer (an Euclidean distance less than 8). We anticipate a pattern matching algorithm originating from computer vision applications could be employed in this type of audio event separation and processing.

The implementation of this convolutional neural network adopts the same machine learning approach proposed by³⁰. Using the same urban sound recognition task, this allows a direct comparison of the features extracted from the physical reservoir computer as well as the spectrogram technique that is normally applied. Using the same machine learning readout but without computationally expensive preprocessing of the audio, the physical reservoir computing architecture employed in this paper achieved a 10% accuracy improvement compared to³⁰. In realistic applications for the Internet of Things, this machine learning method can be applied using dedicated neural processors, such as the Syntiant ND101. This particular chip could deploy approximately 60,000 neural cores, well above the requirement of the machine learning model used in the paper (\(\sim\)40,000 neural cores). As an alternative approach, the features generated from the reservoir computer could be further engineered to compress the amount of the data for audio recognition, such that the models can be deployed on low-level edge processors.

There are still limits in the reservoir computing method using the Hopf oscillator in its current form. First, the high accuracy sound event recognition requires many virtual nodes to generate diverse features for machine perception. However, increasing the virtual nodes leads to exponential growth of the sampling rate to read high quality audio data. We are actively seeking solutions to separate audio features from the original signal for recognition and recording, which could decrease the required sampling rate. Second, the current circuit-based physical reservoir separates the process of signal mixing and activation of the circuit. Redesigning the circuit is necessary to simplify signal reading for future system deployment. However, the ultimate version of the Hopf reservoir using MEMS will solve this problem, since the computing will happen on the audio sensing mechanisms. Lastly, the signal processing still relies on a digital readout. Though the algorithm is remarkably simple, a microcontroller unit is needed. We anticipate that the short-term solution will be deploying the optimized machine learning model as firmware (consuming less than 1 MB size of static memory without optimization and less than 256 KB dynamic memory for training upgraded machine learning models). A future goal should be using an analog circuit that could detect the spike signals for audio recognition (similar to neurons) to achieve a fully analog computer on edge devices⁴⁰.

Source link