A deep learning pipeline for age prediction from vocalisations of the domestic feline

Machine Learning


Research design and rationale

This study employed a deep learning approach to age estimation in domestic cats based on vocalisations. The methodology integrated both data collection and model development combined with transfer learning models for feature extraction and downstream classification tasks. Given the lack of a publicly available dataset suitable for this task, we conducted data collection and explored existing datasets for potential inclusion. Following data collection, we applied transfer learning models (VGGish, YAMNet, Perch) for feature extraction, and processed the resulting embeddings through an MLP neural network for classification. The following sections describe in detail the data collection approaches (see Data Collection, Data Context Restrictions, and Data Sharing), feature extraction through transfer learning (Transfer Learning Models for Feature Extraction, and Audio Input Requirements for VGGish, YAMNet, and Perch), the preparation of feature embeddings for downstream learning (Preparing Feature Embeddings for Downstream Learning), hyperparameter tuning and bias mitigation (Hyperparameter Tuning and Bias Mitigation), data enhancement techniques (Data Enhancement), and an ethics declaration (Ethics Declaration). Figure 1 provides an overview of the full methodology pipeline.

Data collection

Due to the lack of a suitable available dataset, original data collection was conducted. The initial consideration for this research was to, partly, use the CatMeow (Ludovico et al., 2020) dataset. Unfortunately, this dataset was recorded using low-budget Bluetooth devices, which, while making the data collection accessible and scalable, compromised the audio quality. Problems such as white noise and poor sound quality overall, including interruptions in the waveforms, led to the decision to exclude this dataset from our study. We reached out to the owners of two other datasets;45 had not recorded ages and11 could not yet publicise due to ongoing research. The dataset presented in this work is the first publicly available dataset of its kind.

For the data collection we reached out to online communities through platforms such as Reddit, Lemmy, TikTok, LinkedIn, Instagram, and local communities through personal networks, cafes, and veterinarians, supplemented by information leaflets and promotional videos. A significant amount of contributions were sourced through Lemmy and freesound.org, each harbouring a supportive and engaged community.

From freesound.org, only lossless recordings in either WAV, FLAC, or M4A were selected, ensuring audio quality was of a high standard. The required information for the samples, such as the cat’s ages, was acquired by contacting the owners directly. The remaining participants were instructed to use the ’Dolby On’ recording app (no affiliation), specifically set to Lossless audio in WAV format at 48kHz, to ensure uniformity and optimal audio quality when recording from mobile devices. Individual submissions in other formats or of poor quality were discarded. The majority of samples collected were at 48kHz (63%), followed by 44.1kHz (24%), and some at 96kHz (7%) and 16kHz (6%), all in 16-bit audio and transformed to Waveform Audio File Format (WAV) to fit the expected input format of the transfer learning models. While the unique recording settings for each contribution may enhance real-world generalisation, in the context of a smaller dataset this could become a disadvantage. The model has fewer environmentally consistent examples to learn from, which may hinder performance. However, we prioritised generalisation over potentially over-optimistic results.

Table 2 Dataset summary and demographic distribution.

Data context restrictions

Inclusion was restricted to those cases where owners had certainty over their pet’s age. Breed information was generally not collected due to many owners’ uncertainty about their pets’ specific breeds. In some samples contexts were noted including vets visits, food, attention, and door. Some users provided valuable time-series data from the same cats over different years – ideal for our task. For the purpose of direct comparison and due to data limitations, types of vocalisations included in this research include meows (a sound that often includes multiple vowels), mews (higher pitched meows), and squeaks (short raspy, nasal, high-pitched mew-like call) as defined by46 and later expanded by10. Vocalisations such as purrs and yowls were therefore discarded. Noisy samples or those with overlapping sounds were also discarded. Litters of kittens could not be uniquely identified so were grouped together.

A total of 793 meows were included, each manually extracted using Audacity to ensure the highest quality. In some cases, a noise section of otherwise good quality meows were cropped out – ensuring each cat had at least one full meow. This modification does mean that vocal duration cannot be taken into account in this study. A statistical summary of the dataset is presented in Table 2.

In our dataset the youngest cat is five weeks old while the oldest cat is eighteen years old. Considering the relatively small dataset, it was decided to focus on three age categories for classification: kittens, adults, and seniors. Since there is no clear consensus on when an adult cat transitions into a senior, the age ranges were defined as follows: kittens (0 – 0.5 years), adults (0.5 – 10 years), and seniors (10+ years). These ranges are based on typical developmental stages in cats where kittens represent early life stages, adults cover the majority of a cat’s mature lifespan, and seniors reflect the later years. This categorisation results in 135 kittens, 405 adults, and 253 senior cats, leading to an imbalanced dataset. Finally, each cat is assigned a unique identifier that allows us to group cats together during training and testing, avoiding data leakage.

Data sharing

There is a lack of open source data in bioacoustics47; just 21% of publications in the field publish their recordings for further research48. The dataset compiled for this study will therefore be made available alongside the publication, ensuring that future researchers can access high-quality audio samples for further analysis.

Transfer learning models for feature extraction

As discussed, VGGish, YAMNet, and Perch have been proven to be effective in a number of wide-ranging applications for sound detection25,31,32,33,34,35,36,37. Their open accessibility and comprehensive documentation facilitates straightforward use, enabling replicability and potential for further development. All models automatically convert raw input audio in WAV format to spectrograms for feature extraction into high-dimensional vector embeddings; numerical representations of the data. A summary of these models along with precise input requirements and restrictions are discussed as follows.

VGGish is modified from the VGG Convolutional Neural Network (CNN) – characterised by its simplicity, depth, the use of small convolutional filters, and known for top performance in image classification tasks49. YAMNet is another CNN-based deep neural network, utilising the MobileNet V1 architecture – known for its efficiency and effectiveness, particularly in environments with limited computational resources50. Perch contains an EfficientNet B1 backbone – a CNN designed for high efficiency and scalability across different computing environments51. While VGGish was trained on YouTube-8M (closely related to AudioSet) and YAMNet directly on the AudioSet dataset52, both of which focus on general sound events, Perch was specifically trained on birdsong embeddings derived from XenoCanto wildlife data. An overview of the three architectures is presented in Table 3.

All models can either be run locally or through hosted versions on TensorFlow Hub (TFHub). VGGish performed better locally using TensorFlow’s Model Garden53, possibly due to system configurations or outdated versions on TFHub. The core dependencies for this local setup can be found in Table 4. For YAMNet, TFHub inference54 was found more efficient than the Model Garden build with minimal performance difference, so this option was chosen. Perch, lacking public pre-trained network weights and requiring substantial computational resources, was not built locally therefore also used TFHub inference55.

Table 3 Comparison of neural network architectures for audio feature extraction.
Table 4 System configuration for VGGish environment.

Audio input requirements for VGGish, YAMNet, and Perch

This section describes the input requirements for each architecture based on model documentation and codebases. We found that looping the audio data to fit the required embedding window size (Table 3) was preferable to padding it with silence. All models normalise the audio to a range of -1.0 to +1.0 and apply a Short-Time Fourier Transform (STFT) to generate the spectrogram. The STFT is a technique that transforms the audio signal from the time domain to the frequency domain, enabling the analysis of frequency properties of sound signals. By segmenting the signal into small overlapping time windows, the STFT allows for a detailed analysis of frequency variations over time, which is represented visually as a spectrogram.

VGGish

VGGish accepts raw WAV files for feature inference and generates feature embeddings in 0.96-second intervals. For this study, we modified the sample window hop size to 0.48 second (from 0.96 seconds) to ensure a 50% overlap in audio processing, aligning with YAMNet’s pipeline. The audio is looped using Pydub to fit these increments. VGGish resamples the audio to 16kHz mono, normalises it, and computes a spectrogram using STFT with a window size of 25ms and 10ms hop size. This results in overlapping frames of 25ms each. The spectrogram is then mapped to 64 mel bins, which represent frequency bands on a scale that more closely matches how humans perceive sound. This scale emphasizes lower frequencies and compresses higher ones, making it well-suited for analysing vocalisations. In this case, the bins cover a frequency range from 125Hz-7500Hz, which is sufficient for cats but may need adjustment for other species. Finally, a log function was applied to produce a log mel-frequency spectrogram, and the resulting features passed to VGGish’s neural network model to generate the numerical embeddings.

YAMNet

Developed by the same authors as VGGish, YAMNet follows an identical input process to VGGish, with the same STFT configuration and audio sampling.

Perch

Perch processes audio in significantly longer 5-second segments and uses a 32 kHz mono sampling rate, double the rate of VGGish and YAMNet. A small proportion of samples (6%) were upsampled to match this rate, though this is expected to have minimal impact since they represent just two cats. While it is known that Perch uses STFT to create spectrograms, further configuration details are not disclosed in the TFHub documentation so we can not make any further statements on the internal processing.

Preparing feature embeddings for downstream learning

One of the benefits of using pre-trained models as feature extractors is that it simplified data processing; the transfer learning models internally normalise and process audio clips. After feature extraction we passed the embeddings to an MLP neural network, a type of artificial neural network mimicking the inner workings of the human brain and one of the most commonly used neural networks56.

For this downstream task, data standardisation and scaling is important to ensure that all features contribute equally to model training. We applied Scikit-learn’s StandardScaler to ensure features have a mean of zero and a standard deviation of one, which helps improve the training stability and performance of models like MLPs by ensuring that each feature contributes equally to the learning process. Other scalers like RobustScaler and MixMaxScaler were explored, but StandardScaler showed best initial results. Further tuning may be needed.

Hyperparameter tuning and bias mitigation

The downstream models were developed with TensorFlow’s Keras, a well known end-to-end open source machine learning framework. To determine the best configuration, we used the Optuna hyperparameter tuner for its ease of implementation. For each model, VGGish, Perch, and YAMNet, 300 trials were run to find the optimal parameters. The depth of our exploration is detailed in table 5.

Table 5 Hyperparameters and their search space for tuning with Optuna.

Stratified sampling and grouping

Stratified sampling is a technique to ensure that samples in both the train and test set are selected in the same proportion57, crucial for handling our imbalanced dataset. To prevent data leakage between the train and test set, we grouped samples by cat_id using Scikit-learn’s StratifiedGroupKFold, which allowed for stratification while ensuring no overlap between splits. This technique ensured that all samples from each cat_id appeared in the test set exactly once across k-fold validation rounds. With limited data, 4 folds was the limit without causing severe class imbalances.

One adaption to the data distribution after splitting is that adult cat 000A and kitten 046A were always included in the training set, swapped out for a random cat_id of the same class when necessary. This decision was made because these cats in particular contributed a high number of samples and over varying contexts, containing valuable information crucial for learning performance. Lastly, to avoid patterns in the data that could be learned by the model, we shuffled each class’s samples once before splitting and once right before model training.

Avoiding overfitting and bias

Dropout layers58 (Keras function) and an early stop function58 (Keras function) were used to minimise overfitting. To further minimise bias and to obtain a robust statistical performance estimate, we applied nested cross-validation59. This technique, visualised in Fig. 2, allowed us to tune hyperparameters on a validation set in the inner loop while using an unseen test set in the outer loop for final evaluation. This prevents tuning bias and allows a fair comparison between architectures.

Each Optuna trial included 4 outer and 4 inner loops (16 runs total), where Optuna optimised the next set of parameters based on the average F1-score in our inner loop. Each set of hyperparameters is therefore optimised without direct knowledge of the outer test data, which it would eventually be evaluated against.

Table 6 System configuration for downstream learning after obtaining the feature embeddings from the respective transfer learning models: VGGish, Perch, and YAMNet.
Fig. 2
figure 2

Schematic of the nested cross-validation method used in our study. The dataset is divided into four outer folds, each containing a distinct test set and a corresponding training set. Each training set is further divided into four inner folds used for validation of the hyperparameters during Optuna’s tuning process.

Challenges and adaptations

A limitation of this method is that subdividing an already small dataset in the inner loop further reduces data available for training, leading to poorer performance on the inner validation set. As a result, hyperparameter tuning in the inner loop does not always generalise well to the outer (test) set. While “better” hyperparameters could be obtained by tuning on the outer set, this would introduce bias as the model would adapt to the test data and would violate the assumption of unseen data59. Additionally, due to computational constraints we had to slightly adapt the cross-validation method as described; we averaged the F1-score across inner folds rather than tuning for each fold individually. While this may introduce a small amount of bias, we believe that the combination of nested cross-validation, multi-seed validation (discussed below), and the careful data-splitting and grouping by cat_id are the best practices for unbiased model evaluation and optimising. After determining the best parameters, we manually fine-tuned them to further optimise performance.

Performance metrics and testing for robustness

As we are dealing with imbalanced data, the formula for macro-averaging as per Scikit-learn’s model evaluation was applied to give equal weight to each class in performance calculations60, with a focus on the F1-scores for comparison.

$$\text {Macro-}M = \frac{1}{|L|} \sum _{l \in L} M(y_l, \hat{y}_l)$$

Here, \(\text {Macro-M}\) represents the macro-averaged metric. |L| denotes the number of classes. The summation \(\sum _{l \in L}\) indicates a sum over each class. \(M(y_l, \hat{y}_l)\) is the F1-score for a single class l, calculated from its true labels (\(y_l\)) and predicted labels (\(\hat{y}_l\)).

The F1-score is a statistical measure that balances the precision (ratio of correctly predicted positives to total predicted positives) and recall (ratio of correctly predicted positives to all actual positives). This single score helps us assess our model’s performance by identifying the accuracy of true positives. Finally, to determine robustness and reliability of the results to real-world application, we applied Levene’s test for statistical significance of variance over the aggregated metrics, as outlined by61.

Our final performance estimate applies standard 4 fold cross-validation with train/test split only – the validation set is left out to preserve more data for training, considering prior intensive validation during optimisation on the inner loop. The results are run over five different seeds to enhance reliability while ensuring reproducibility. The five seeds are randomly generated by another (reproducible) random seed – namely 42. The grouping by cat_id further contributes to a truer performance estimate as the test set will always contain samples recorded in settings different to the training samples; these novel conditions act as unseen real-world samples. The final performance estimates are averaged from cross-validation over seeds and folds to provide a robust estimate of performance on real-world data. The system configuration used for the downstream tasks in this project are detailed in Table 6. The hyperparameters used to train the MLPs with the extracted embeddings from each TL architecture, are detailed in Tables 7, 8, and 9.

Table 7 VGGish embeddings hyperparameters for downstream MLP task.
Table 8 YAMNet embeddings hyperparameters for downstream MLP task.
Table 9 Perch embeddings hyperparameters for downstream MLP task.

Data enhancement

In this study, various data augmentation techniques including pitch shift, time stretch, and gain manipulation (implemented using audiomentations) were applied across multiple configurations to balance class distributions and improve model performance. Detailed audiomentations results are omitted for brevity but are available upon request. Popular data augmentation techniques SMOTE and MixUp were also tested but failed to yield significant performance; these results are available in the supplementary code repository.

Despite the breadth of augmentation approaches, none demonstrated substantial improvement in classification performance. This suggests that these methods, while useful for specific contexts, might not be sufficient for enhancing performance in this case. Although our experiments did not show significant benefits from traditional augmentation methods, these observations highlight the importance of model experimentation and optimisation in digital bioacoustics and future work could seek to explore more advanced augmentation techniques such as SpecAugment, which operates directly on spectrograms.

In order to address the class imbalance, class weight balancing was applied using the compute_class_weight utility from Scikit-learn. This technique adjusts the loss function by assigning higher penalties to minority classes, thus encouraging the model to better handle imbalanced datasets. Early experiments showed that this technique did enhance performance and was applied to all categorical models both during tuning and testing phases. The specific class weights computed for each architecture, seed and fold are provided in Appendix A.

Ethics declaration

The project adheres to the ethical and professional considerations outlined by the British Computer Society’s (BCS) Code of Conduct and confirms to the ARRIVE guidelines for animal research.

Prior to the data collection phase in this project, ethical approval was sought from the University of Essex for which the preliminary phase of this research was performed. The proposed methods for collection of feline vocalisations were approved with little concerns. The same was reported by11 when seeking approval from the Swedish Ethical Review Authority, who likewise confirmed that observations of privately owned cats in their home environment do not require further ethical approval.

Nevertheless, ethical compliance was safeguarded by providing all participants with information outlining their rights to withdraw, data anonymisation, consent for research use, and their data subject rights (Appendix B). This ensures the privacy of the individuals is protected, in compliance with the General Data Protection Regulation (GDPR). This is equally relevant to any human voices accidentally captured during PAM in future work, which are to be excluded from processing47,62.

Table 10 Categorical results (kitten, adult, senior) on downstream learning task from feature embeddings extracted with VGGish, Perch, and YAMNet. Abbreviations: Acc. (W) = Weighted Accuracy; Acc. = Accuracy; D. (s) = Duration in seconds. Accuracy, Precision, Recall, and F1-Score values have been calculated with macro-averaging to account for data imbalance.
Table 11 Binary results (kitten, senior) on downstream learning task from feature embeddings extracted with VGGish, Perch, and YAMNet. Abbreviations: Acc. (W) = Weighted Accuracy; Acc. = Accuracy; D. (s) = Duration in seconds. Accuracy, Precision, Recall, and F1-Score values have been calculated with macro-averaging to account for data imbalance.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *