This project focuses on classifying bird vocalizations and extends previous work from Engineers for Exploration (E4E). We are developing techniques to overcome challenges in audio data classification due to domain shift from labeled training data to unlabeled, noisy field recordings.
Our objective is to leverage modern machine learning architectures and domain-specific adaptations to accurately identify bird species from acoustic data. This work is part of our broader efforts to aid ecological research and conservation through advanced bioacoustic classification systems.
Currently, we are working on implementing and evaluating various machine learning models, including CNNs and newer RNN architectures, to improve the accuracy and reliability of our identification systems. Our work contributes to the BirdCLEF 2024 competition on Kaggle, aiming to push the boundaries of current bioacoustic research.
The standard pipeline for audio classification is convolutional; an image is first converted into a spectrogram, which is then passed through a standard convolutional architecture like EfficientNet. However, recently a new class of state-space models have been shown to be effective in audio classification tasks. We explored the potential application of such models to bird call classification, integrating them into our existing pipeline to compare their performance in a controlled environment.
We chose the Mamba architecture as the backbone for our method, owing to its relative success on modeling audio data. One initial challenge was that the version of Mamba compatible with audio data -- a mixture of Mamba and S4 in a Sashimi architecture -- has not been publicly implemented. Splicing together code, we obtained a model that successfully began to train but was unusably slow in training, likely due to backpropagation through time for the time-varying Mamba component. The Mamba model could train at roughly 1.5 clips per second, compared to 180 from EfficientNet. After training for 5 days on a 24-GB GPU, the model had trained for only two epochs and achieved validation mAP of roughly 0.1 -- an improvement over its 0.05 to start with, but nowhere near the 0.5 to 0.7 range of EfficientNet trained on many more epochs. At this point, we determined that without significant optimization, the model would not be able to compete with our existing pipeline.
We found more success in adapting SSAMBA, a Mamba-based model operating on spectrograms which is pretrained in such a way as to adapt quickly to classification tasks. The SSAMBA work found success in classification on the ESC dataset, which includes some categories of bird calls, and so it seemed to be a natural fit for our task. Such a pretrained model would hopefully be more data-efficient than our existing pipeline, which is trained from scratch and thus begins with no acoustic pattern recognition.
We integrated SSAMBA into our pipeline, utilizing the same data augmentation and training setups as the EfficientNet models and the optimization parameters from the SSAMBA paper. The model is still significantly less efficient in training than EfficientNet, at a rate of 7 to 8 clips per second. Due to time constraints, all comparisons are made at 2 epochs. This is not entirely fair, as EfficientNet is trained for 50 epochs, but it is a reasonable comparison given the ability to train SSAMBA. We find that SSAMBA-small and SSAMBA-base both achieve a higher train mAP than EfficientNet, exceeding 0.30 when EFC is only 0.27 (SSAMBA-tiny is still training). However, the validation mAP is lower for both SSAMBA models, indicating that they are overfitting. Given that SSAMBA-small has only 26M parameters to EFC's 39M, and that SSAMBA-tiny is on a trajectory to win in training mAP as well, it is interesting to see that SSAMBA is sufficiently expressive to overfit to such a complicated distribution. This is encouraging, with the potential that small SSAMBA models with proper regularization and more aggressive data augmentation could be competitive in this domain.
Background. In addition to exploring additional architectures, we also consider alternative representations of audio data. As noted above, a common approach is to compute the spectrogram of an audio clip before treating it as an RGB image. In particular, this means learning using architectures designed for natural images. Whereas these architectures have been seen to be able to recover perceptually salient features such as edges and textures in natural images, it is less clear whether analogous features should be as prominent in spectrograms.
Prior work. We adapted the expand-and-sparsify representation, which was introduced by Dasgupta and Tosh (2020). In their paper, they study a simplified neural architecture found initially found in the fruit fly's olfactory system. In the fly brain, representations of olfactory stimuli are constructed as follows: a fly has around 50 types of receptors that can detect different components of odors. When exposed to a stimulus, the corresponding activations generate an initial dense representation of the odor, which we can model as a vector in 50 dimensions. In the expand step, this initial representation is randomly projected into much higher dimensions. This is followed by a sparsify step, where only the top-k activations are kept, so that this leads to a k-sparse high-dimensional representation. It was shown that such a representation can disentangle non-linear features, so that it becomes possible to learn by simply training a last linear layer. Here, we ask whether such a representation can be beneficial for acoustic data as well.
Our work.
In order to adapt this method to time series data, we
implemented a convolutional version of the expand-and-sparsify
architecture. To briefly describe this method, suppose we have
a Mel spectrogram, which is a multi-dimensional time series.
That is, it is an array of size (n_mels, duration)
.
Then, we can construct a kernel of size
(n_output, n_mels, window)
which scans along the
spectrogram (here the length of window
is smaller
than the audio clip's duration
). After taking the
convolution, we obtain a new time series of with dimension
n_output
. This can be passed to downstream models.
Experimental method.
To evaluate this method, we built a small residual network
for time series data. We can think of a time series object as
having two axes: a spatial axis and a temporal axis. Consider
a d
-dimensional time series of length
T
, which is an array of shape (d,T)
.
We design the resnet to have convolutional kernels that span all
spatial dimensions at once, so that its shape is
(d, T_window)
. We trained two such resnets: (a) the
baseline takes the Mel spectrogram directly, while (b) the
experimental version takes the expand-and-sparsify
representation.
Results. While the hope was that the expand-and-sparsify representation can speed up training by disentagling non-linear features, it does not appear to make learning easier. We submitted these two models after 2 epochs of training to the BirdCLEF competition. The baseline (a) received a score of 0.58, while the experimental version (b) received a score of 0.48, performing no better than random. Therefore, it seems that in fact, this representation along with the choice of architecture led to an increase of difficulty in learning.
Additional discussion. We continued training the baseline (a) for 40 epochs, and it receives a slightly improved score of 0.60 on BirdCLEF. On the one hand, this performs worse than an comparably-trained EfficientNet with a score of 0.64. However, the this baseline (a) is smaller in size (48mb vs 74mb), and it runs much faster on CPU only, as required by the BirdCLEF competition (20min vs 80min). This suggests that it may be worth exploring additional resnet architectures that treat the spectrogram as a time series rather than a natural image. Finally, the poor results for the expand-and-sparsify representation do not necessarily rule it out; a more robust set of experiments is required for this, and there are further modifications suggested in the original paper that we can also try.
We use ensemble learning methods to enhance the accuracy of our bird species identification project, which is part of the BirdCLEF competition focusing on identifying bird sounds from complex audio environments. This competition requires analyzing diverse audio data, including noisy field recordings with overlapping bird calls. Our approach combines several advanced machine learning models, each trained independently, to better predict bird species. Specifically, we utilized models like tf_efficientnet_b4
and resnetv2_101
, achieving our highest score of 0.66 with the latter after 50 training epochs. Other models like eca_nfnet_l0
and mobilenetv3_large_100_miil_in21k
scored 0.62 and 0.60, respectively, demonstrating robust performance across varied audio datasets. We are also currently evaluating an enhanced ensemble strategy that combines two powerful models, mobilenetv3_large_100_miil_in21k
and resnetv2_50
, using an optimization tool called ONNX to further improve processing efficiency and model performance. Together, mobilenet
and resnetv2_50
achieved a combined score of 0.59.
However, not all attempts were successful; models such as seresnext101_32x4d
and eca_nfnet_l2
timed out, likely due to their large size which made processing within Kaggle's constraints challenging. Participating in Kaggle proved difficult due to several limitations, including no internet access during submission, an unorganized file system, and the lack of GPU usage, which significantly hindered our ability to utilize more computationally intensive models effectively.
Despite these challenges, our team achieved an impressive 80/780, earning a bronze medal in the competition. This accomplishment underscores the effectiveness of our ensemble strategy in handling the complex task of bird species identification from audio data, even within the restrictive environment of a competitive Kaggle contest.
Explore detailed reports and documents related to our project: