Bigger = better?
In AI, bigger is often better, provided you have enough data to feed large models. However, with limited data, The larger the model, the more likely it is to overfit.. Overfitting occurs when a model memorizes patterns in the training data that do not generalize well to real-world data examples. But there's another way to approach this that I think is even more compelling in this context.
Suppose you have a small dataset of spectrograms and you need to decide between a small CNN model (100,000 parameters) and a large CNN (10 million parameters). All model parameters are effectively best-guess numbers derived from the training dataset.Thinking about it this way, it's clear that it's easier for a model to get 100,000 parameters correct than it is to get 10 million parameters correct.
In the end, both arguments lead to the same conclusion.
If you lack data, consider building smaller models that focus only on important patterns.
But how can we actually achieve smaller models?
Do not crack walnuts with a hammer
My learning journey in Music AI has been dominated by deep learning. Until a year ago, I was using large neural networks to solve almost every problem. While this makes sense for complex tasks like music tagging and instrument recognition, Not all tasks are so complex.
For example, by analyzing the time between onsets or correlating chromagrams with key profiles, you can build good BPM estimators and key detectors without machine learning.
Even for a task such as music tagging, it doesn't necessarily have to be a deep learning model: I've achieved good results with mood tagging using simple K-nearest neighbor classifiers on the embedding space (e.g. CLAP).
Most of the state-of-the-art methods in music AI are based on deep learning; Alternative solutions need to be considered in situations of data scarcity.
Be careful with data input size
Typically more important than the model choice is the input data choice. In Music AI, we rarely use raw waveforms as inputs due to their low data efficiency. We can reduce the dimensionality of the input data by converting the waveforms to (mel) spectrograms. More than 100 times. This is important because larger or more complex models are typically required to handle large data inputs.
To minimize the size of your model inputs, you can take two routes.
- Using short music snippets
- Uses a more compressed/simplified musical representation.
Use small music snippets
Using smaller music snippets is especially effective when the result you're interested in is global, meaning it applies to all sections of the song. For example, you can assume that a track's genre is relatively stable across tracks. So instead of the entire track (or the very common 30 second snippet) he could easily use a 10 second snippet in a genre classification task.
This has two benefits:
- Shorter snippets mean fewer data points per training sample, allowing you to use a smaller model.
- By drawing three 10-second snippets instead of one 30-second snippet, we can triple the number of training observations. Overall, this means that we can build a less data-intensive model while at the same time providing it with many more training examples than before.
but, There are two potential dangers hereFirst, the size of the snippets must be long enough to make classification possible. For example, even humans have a hard time classifying genres when presented with 3-second snippets. You should choose the snippet size carefully and view this decision as a hyperparameter of your AI solution.
Secondly, Not all musical attributes are global. For example, if a song contains vocals, it doesn't mean there isn't an instrumental section. Splitting the track into very short snippets introduces many incorrectly labeled samples into the training dataset.
Use more efficient musical expressions
If you studied music AI 10 years ago (back then it was called “music information retrieval”), you learned about chromagrams, MFCCs, and beat histograms. These handcrafted features were designed to make music data work with traditional ML approaches.With the rise of deep learning, these characteristics Completely replaced by (mel) spectrogram.
Spectrogram compresses music into an image without losing too much information, Ideal for combination with computer vision models. Instead of designing custom functionality for different tasks, you can now use the same input data representation and model for most Music AI problems. But only if you have tens of thousands of training samples to feed these models.
When data is missing, we want to do the following: Compress information as much as possible This is to help the model extract relevant patterns from the data. Considering the four musical expressions below, which one would help you identify the key of the music the fastest?
While mel-spectrograms can be used as input for non-trivial detection systems (and they should, if you have enough data), simple chromagrams averaged along the time dimension reveal this particular information more quickly. Therefore, while spectrograms require complex models such as CNNs, chromagrams can be easily analyzed with traditional models such as logistic regression or decision trees.
In summaryGiven enough data, the well-established spectrogram + CNN combination can still be very effective for many problems. However, if the dataset is small, it may make sense to revisit some feature engineering techniques for MIR or develop your own task-specific representation.