Have a personal or library account? Click to login
MusiQAl: A Dataset for Music Question–Answering through Audio–Video Fusion Cover

MusiQAl: A Dataset for Music Question–Answering through Audio–Video Fusion

Open Access
|Jul 2025

Figures & Tables

tismir-8-1-222-g1.png
Figure 1

Overview of the music question–answering system. Features from input data (video, audio, and text) are processed and spatiotemporally mapped (grounded) through the AVST/LAVISH model to generate an answer (A).

tismir-8-1-222-g2.png
Figure 2

Distribution of singing (left) and dance (right) styles by geographic origin. Styles are listed on the x‑axes, with y‑axes showing their corresponding linear percentage (%). Bar colors denote geographic origin (see legend). The presented style taxonomies and regional annotations were curated by the authors for this study.

Table 1

The instrumental categories in MusiQAl.

StringsWindsPercussionKeyboardsOther
BanjoBassBassoonClarinetBellCastanetsAccordionComputer
CelloDouble BassEuphoniumFluteDjembeDrumPianoSteps
Cretan LyraFiddleHornPipeDrumsDundunSynthesizerVoice
GuitarHarpSaxophoneTromboneGlockenspielKenongClap
KotoLaoutoTrumpetTsabounaKenongSaron
OutiQanunUilleann PipeWhistleShakerTabla
SitarViolinTaikoTambourine
Tympani
tismir-8-1-222-g3.png
Figure 3

Data distribution by performance type in MusiQAl.

tismir-8-1-222-g4.png
Figure 4

Overview of singing styles and dance types in MusiQAl, along with their countries of origin. The symbol in the center of the map denotes global data.

Table 2

Overview of the question categories present in four former music question–answering datasets compared to MusiQAl.

Dataset# QuestionsA QuestionsV QuestionsAV Question
ExistentialLocationCountingComparingTemporalCausalPurpose
MusicQA (Liu et al., 2024a)112K
MuChoMusic (Weck et al., 2024)1.1 K
MUSIC‑AVQA (Li et al., 2022)45K
MUSIC‑AVQA‑v2.0 (Liu et al., 2024b)53K
MusiQAl12K

[i] The questions can be based on audio (A), video (V), or both audio and video (AV).

tismir-8-1-222-g5.png
Figure 5

Distribution of question categories in different scenarios. The count of questions (y‑axis) per category (x‑axis) for each modality (audio–video, audio‑only, and video‑only).

tismir-8-1-222-g6.png
Figure 6

N‑gram analysis of question formulation. This plot shows the frequency of common word sequences (up to four words) at the beginning of questions, revealing dominant patterns in question formulation.

tismir-8-1-222-g7.png
Figure 7

Question taxonomy. This visual representation highlights the hierarchical structure and interconnections between different questions.

Table 3

Performance comparison of the AVST and LAVISH models on the MusiQAl dataset.

ModelAudio (A)
ExistentialLocationCountingComparativeTemporalCausalPurposeAvg
AVST ()76.11N/A69.7972.6576.3780.00N/A72.32
(std)1.822.321.003.290.001.02
LAVISH ()55.00N/A77.4668.7752.9076.00N/A66.03
(std)2.381.011.474.338.942.03
ModelVideo (V)
ExistentialLocationCountingComparativeTemporalCausalPurposeAvg
AVST ()94.5569.0868.75N/A73.4163.51N/A69.48
(std)4.981.551.712.062.300.87
LAVISH ()77.7461.4574.91N/A69.6360.00N/A69.60
(std)3.500.691.694.653.121.41
ModelAudio–Video (AV)
ExistentialLocationCountingComparativeTemporalCausalPurposeAvgAvgAll
AVST ()61.1966.1069.0573.6870.5564.4065.9766.2368.21
(std)0.881.111.213.441.402.441.401.140.95
LAVISH ()80.8063.8364.4282.3750.0076.9973.2370.0869.78
(std)0.381.510.550.711.401.631.380.270.48

[i] The results are detailed in three tables, corresponding to audio‑only (A), video‑only (V), and combined audio‑video (AV) question categories. All values represent answer prediction accuracy in percentage (%). For each model and category, the mean accuracy () over five independent runs is reported. The ‘(std)’ row presents the standard deviation, quantifying the performance variability across these runs. Both models were trained on human‑annotated question–answer pairs. Higher scores, indicating superior performance for a given category between the two models, are highlighted (in bold). ‘Avg’ refers to the average accuracy across all question categories within a specific modality (A, V, or AV), while ‘AvgAll’ represents the overall average across all three modalities.

tismir-8-1-222-g8.png
Figure 8

Accuracy of different questions based on modality (audio (A), audio–video (AV), or video (V)) and category (Causal, Comparative, Counting, Existential, Location, Purpose, Temporal).

DOI: https://doi.org/10.5334/tismir.222 | Journal eISSN: 2514-3298
Language: English
Submitted on: Sep 1, 2024
Accepted on: May 30, 2025
Published on: Jul 31, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Anna-Maria Christodoulou, Kyrre Glette, Olivier Lartillot, Alexander Refsum Jensenius, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.