TY - JOUR
T1 - ATGNN
T2 - Audio Tagging Graph Neural Network
AU - Singh, Shubhr
AU - Steinmetz, Christian J.
AU - Benetos, Emmanouil
AU - Phan, Huy
AU - Stowell, Dan
N1 - Publisher Copyright:
© 1994-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging. Recent works have shown that despite stacking multiple layers, the receptive field of CNNs remains severely limited. Transformers on the other hand are able to map global context through self-attention, but treat the spectrogram as a sequence of patches which is not flexible enough to capture irregular audio objects. In this letter, we treat the spectrogram in a more flexible way by considering it as graph structure and process it with a novel graph neural architecture called ATGNN. ATGNN not only combines the capability of CNNs with the global information sharing ability of Graph Neural Networks, but also maps semantic relationships between learnable class embeddings and corresponding spectrogram regions. We evaluate ATGNN on two audio tagging tasks, where it achieves 0.585 mAP on the FSD50 K dataset and 0.335 mAP on the AudioSet-balanced dataset, achieving comparable results to Transformer based models with significantly lower number of learnable parameters.
AB - Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging. Recent works have shown that despite stacking multiple layers, the receptive field of CNNs remains severely limited. Transformers on the other hand are able to map global context through self-attention, but treat the spectrogram as a sequence of patches which is not flexible enough to capture irregular audio objects. In this letter, we treat the spectrogram in a more flexible way by considering it as graph structure and process it with a novel graph neural architecture called ATGNN. ATGNN not only combines the capability of CNNs with the global information sharing ability of Graph Neural Networks, but also maps semantic relationships between learnable class embeddings and corresponding spectrogram regions. We evaluate ATGNN on two audio tagging tasks, where it achieves 0.585 mAP on the FSD50 K dataset and 0.335 mAP on the AudioSet-balanced dataset, achieving comparable results to Transformer based models with significantly lower number of learnable parameters.
KW - Audio tagging
KW - computational sound scene analysis
KW - graph neural networks
UR - http://www.scopus.com/inward/record.url?scp=85182937250&partnerID=8YFLogxK
U2 - 10.1109/LSP.2024.3352514
DO - 10.1109/LSP.2024.3352514
M3 - Article
VL - 31
SP - 825
EP - 829
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
ER -