Want to know what attracts me to soundscape analysis?
It is a field that combines science, creativity and exploration, and is a way that others do mostly. First, Your lab is wherever your feet take you – Forest trails, city parks, or remote mountain paths can all be spaces for scientific discovery and acoustic research. Secondly, Monitoring selected geographical areas is all about creativity. Innovation is at the heart of environmental audio research whether you riggle custom devices, hide sensors in tree canopy, or use solar power for off-grid setups. Finally, The huge amount of data is really amazing, And, as we know, in spatial analysis, all methods are fair games. From animal calls for several hours to the subtle hum of urban machinery, the collected acoustic data can be vast and complicated, opening the door to using everything to understand everything, from deep learning to geographical information systems (GIS).
After my previous adventure in one soundscape analysis of a Polish river, I decided to design a bar up and implement a solution that allows me to analyze soundscapes real time. In this blog post, you can explain the proposed method and find the code that drives the entire process, mainly using an audio spectrogram transformer (AST) for sound classification.

method
setting
In this particular case, there are many reasons why I chose to use the Raspberry Pi 4 and Audiomoth combination. Trust me, I have tested a wide range of devices – from power-hungry models of the Raspberry Pi family to various Arduino versions, including Portenta, to Jetson Nano. And that was just the beginning. Choosing the right microphone proved to be even more complicated.
In the end, I went together PI 4 B (4GB RAM) Due to its solid performance and relatively low power consumption (~700mah (When you run the code). Plus, pairing with Audiomoth in USB microphone mode gave us a lot of flexibility during the prototype. Audiomoth is a powerful device with a wealth of configuration options. For example, a sampling rate ranging from 8 kHz to a stunning 384 kHz. I strongly feel that in the long run, this will prove to be the best for my soundscape research.

Capture the sound
Capturing audio from a USB microphone using Python proved to be surprisingly troublesome. After a while of various libraries and struggles, I decided to go back to good old Linux arecord
. The entire sound capture mechanism is encapsulated with the following command:
arecord -d 1 -D plughw:0,7 -f S16_LE -r 16000 -c 1 -q /tmp/audio.wav
I use plugin devices intentionally to allow for automatic conversion. USB microphone composition. The AST is executed 16 kHz Samples, and therefore recording and audio mot sampling are set to this value.
Beware of the code generator. It is important that the device continuously captures audio at specified time intervals. We aimed to store only the latest audio samples on the device and discard them after classification. This approach ensures people's privacy in large-scale urban studies; GDPR Compliance.
import asyncio
import re
import subprocess
from tempfile import TemporaryDirectory
from typing import Any, AsyncGenerator
import librosa
import numpy as np
class AudioDevice:
def __init__(
self,
name: str,
channels: int,
sampling_rate: int,
format: str,
):
self.name = self._match_device(name)
self.channels = channels
self.sampling_rate = sampling_rate
self.format = format
@staticmethod
def _match_device(name: str):
lines = subprocess.check_output(['arecord', '-l'], text=True).splitlines()
devices = [
f'plughw:{m.group(1)},{m.group(2)}'
for line in lines
if name.lower() in line.lower()
if (m := re.search(r'card (\d+):.*device (\d+):', line))
]
if len(devices) == 0:
raise ValueError(f'No devices found matching `{name}`')
if len(devices) > 1:
raise ValueError(f'Multiple devices found matching `{name}` -> {devices}')
return devices[0]
async def continuous_capture(
self,
sample_duration: int = 1,
capture_delay: int = 0,
) -> AsyncGenerator[np.ndarray, Any]:
with TemporaryDirectory() as temp_dir:
temp_file = f'{temp_dir}/audio.wav'
command = (
f'arecord '
f'-d {sample_duration} '
f'-D {self.name} '
f'-f {self.format} '
f'-r {self.sampling_rate} '
f'-c {self.channels} '
f'-q '
f'{temp_file}'
)
while True:
subprocess.check_call(command, shell=True)
data, sr = librosa.load(
temp_file,
sr=self.sampling_rate,
)
await asyncio.sleep(capture_delay)
yield data
Classification
For the most exciting part now.
With an Audio Spectrogram Transformer (AST) and a superior Hagging Face Ecosystem, you can efficiently analyze audio and divide detected segments into over 500 categories.
Note that we have prepared a system to support a variety of pre-trained models. By default, use MIT/AST-FINETUNED-AUDIOSET-10–10–0.4593because it gives the best results and runs well on the Raspberry Pi 4. but, onnx-community/ast-finetuned-audiose-10–10–0.4593-onnx It's also worth exploring Quantized versionrequires less memory and provides inference results more quickly.
You may notice that you do not restrict your model to a single classification label, but that is intentional. Instead of assuming there is one sound source at any time, I Sigmoid functions for model logits Get it Independent probabilities for each class. This allows the model to be expressed Confidence for multiple labels at the same timethis is very important Real-world soundscape Overlapping sources often occur together, such as birds, wind, and distant traffic. take Top 5 results It ensures that the system captures the most likely sound events in the sample without enforcing all the winner's decisions.
from pathlib import Path
from typing import Optional
import numpy as np
import pandas as pd
import torch
from optimum.onnxruntime import ORTModelForAudioClassification
from transformers import AutoFeatureExtractor, ASTForAudioClassification
class AudioClassifier:
def __init__(self, pretrained_ast: str, pretrained_ast_file_name: Optional[str] = None):
if pretrained_ast_file_name and Path(pretrained_ast_file_name).suffix == '.onnx':
self.model = ORTModelForAudioClassification.from_pretrained(
pretrained_ast,
subfolder='onnx',
file_name=pretrained_ast_file_name,
)
self.feature_extractor = AutoFeatureExtractor.from_pretrained(
pretrained_ast,
file_name=pretrained_ast_file_name,
)
else:
self.model = ASTForAudioClassification.from_pretrained(pretrained_ast)
self.feature_extractor = AutoFeatureExtractor.from_pretrained(pretrained_ast)
self.sampling_rate = self.feature_extractor.sampling_rate
async def predict(
self,
audio: np.array,
top_k: int = 5,
) -> pd.DataFrame:
with torch.no_grad():
inputs = self.feature_extractor(
audio,
sampling_rate=self.sampling_rate,
return_tensors='pt',
)
logits = self.model(**inputs).logits[0]
proba = torch.sigmoid(logits)
top_k_indices = torch.argsort(proba)[-top_k:].flip(dims=(0,)).tolist()
return pd.DataFrame(
{
'label': [self.model.config.id2label[i] for i in top_k_indices],
'score': proba[top_k_indices],
}
)
To run an ONNX version of the model, you need to add optimal to the dependencies.
Sound pressure level
In addition to audio classification, it captures information about the sound pressure level. This approach is more than just identifying what Not only does it produce sound, but it also gives you insights How strong Each sound existed. In that way, this model can be used to capture a richer, realistic representation of the acoustic scene and ultimately detect the noise pollution information of finer particles.
import numpy as np
from maad.spl import wav2dBSPL
from maad.util import mean_dB
async def calculate_sound_pressure_level(audio: np.ndarray, gain=10 + 15, sensitivity=-18) -> np.ndarray:
x = wav2dBSPL(audio, gain=gain, sensitivity=sensitivity, Vadc=1.25)
return mean_dB(x, axis=0)
Gain (PREAMP + AMP), sensitivity (DB/V), and VADC (V) are primarily configured for audio moss and are verified experimentally. If you are using a different device, you must refer to the technical specifications to determine these values.
Storage
Data for each sensor is synchronized with the PostgreSQL database every 30 seconds. The current prototype for urban soundscape monitors uses an Ethernet connection. So I'm not limited with regard to network load. More remote areas devices use GSM connections to synchronize data every hour.
label score device sync_id sync_time
Hum 0.43894055 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Mains hum 0.3894045 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Static 0.06389702 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Buzz 0.047603738 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
White noise 0.03204195 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Bee, wasp, etc. 0.40881288 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Fly, housefly 0.38868183 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Insect 0.35616025 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Speech 0.23579548 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Buzz 0.105577625 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
result
Another application built using Streamlit and Plotly accesses this data. It currently displays information about the device location, temporal SPL (sound pressure level), identified sound classes, and various acoustic indexes.

And now we can go. The plan is to expand the sensor network and reach around 20 devices scattered across multiple locations in my city. More information on the deployment of larger area sensors will soon be available.
Additionally, they plan to collect data from deployed sensors and share data packages, dashboards and analytics in future blog posts. It uses an interesting approach that ensures you dig deeper into audio classification. The main idea is to match different sound pressure levels with detected audio classes. I want to find a better way to explain noise pollution. Therefore, please look forward to the detailed breakdown immediately.
In the meantime, you can read a preliminary paper on my soundscape research (headphones are required).
This post was proofread and edited using Grammarly to improve grammar and clarity.