References
- Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N., and Frank, C. (2023). MusicLM: Generating music from text. arXiv:2301.11325.
- Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019). Optuna: A next‑generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, New York, NY, USA, pp. 2623–2631. Association for Computing Machinery.
- Bader, R. (2018).
Musical instruments as synchronized systems . In R. Bader (Ed.), Springer Handbook of Systematic Musicology (pp. 171–196). Springer. - Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. The MIT Press.
- Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., and Bing, L. (2024). VideoLLaMA 2: Advancing spatial‑temporal modeling and audio understanding in video‑LLMs. arXiv:2406.07476 [cs].
- Christodoulou, A.‑M., Lartillot, O., and Jensenius, A. R. (2024). Multimodal music datasets? Challenges and future goals in music processing. International Journal of Multimedia Information Retrieval, 13(3), 37.
- Christodoulou, A.‑M. M., Lartillot, O. S., and Anagnostopoulou, C. (2022). Computational analysis of the Greek folk music of the Aegean islands [Master’s thesis]. National and Kapodistrian University of Athens.
- Clarke, E. (2004).
CHAPTER 5 Empirical methods in the study of performance . In E. Clarke and N. Cook (Eds.), Empirical Musicology: Aims, Methods, Prospects. Oxford University Press. - Clarke, E. F. (2005).
1 Perception, ecology, and music . In E. F. Clarke (Ed.), Ways of Listening: An Ecological Approach to the Perception of Musical Meaning. Oxford University Press. - Clayton, M. (2008).
Theoretical perspectives II: General theories of rhythm and metre . In M. Clayton (Ed.), Time in Indian Music: Rhythm, Metre, and Form in North Indian Rag Performance. Oxford University Press. - Cook, N. (1998).
Introduction: Music and meaning in the CoD1D1ercials . In N. Cook (Ed.), Analysing Musical Multimedia. Oxford University Press. - Dahl, S., and Friberg, A. (2007). Visual perception of expressiveness in musicians’ body movements. Music Perception, 24(5), 433–454. 10.1525/mp.2007.24.5.433.
- Davidson, J. W. (1993). Visual perception of performance manner in the movements of solo musicians. Psychology of Music, 21(2), 103–113.
- Davidson, J. W. (2005).
Bodily communication in musical performance . In D. Miell, R. MacDonald, and D. J. Hargreaves (Eds.), Musical Communication. Oxford University Press. - de Valk, R., Volk, A., Holzapfel, A., Pikrakis, A., Kroher, N., and Six, J. (2017). MIRchiving: Challenges and opportunities of connecting MIR research and digital music archives. In Proceedings of the 4th International Workshop on Digital Libraries for Musicology, DLfM ’17, New York, NY, USA, pp. 25–28. Association for Computing Machinery.
- Duan, H., Xia, Y., Zhou, M., Tang, L., Zhu, J., and Zhao, Z. (2023). Cross‑modal prompts: Adapting large pre‑trained models for audio‑visual downstream tasks. arXiv:2311.05152 [cs].
- Gao, W., Li, X., Jin, C., and Tie, Y. (2022). Music question answering: Cognize and perceive music. In 2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp. 1–6. IEEE.
- Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. (2017). Audio set: An ontology and human‑labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. ISSN: 2379‑190X.
- Gibson, J. J. (1979). The Ecological Approach to Visual Perception: Classic Edition. Psychology Press.
- Godøy, R. I., and Jensenius, A. R. (2009). Body movement in music information retrieval. In Proceedings of the 10th International Society for Music Information Retrieval Conference, pp. 45–50. ISMIR.
- Godøy, R. I., and Leman, M. (Eds.). (2009). Musical Gestures: Sound, Movement, and Meaning. Routledge.
- Gritten, A. (2006). Music and Gesture. Routledge.
- Gritten, A., and King, E. (2011). New Perspectives on Music and Gesture. Ashgate Pub. Open Library ID: OL25042027M.
- Haugen, M. R. (2016). Music‑dance: Investigating rhythm structures in Brazilian samba and Norwegian telespringar performance [Doctoral thesis]. University of Oslo Media.
- He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition. arXiv:1512.03385 [cs].
- Høffding, S., Bishop, L., Burnim, K., Clim, M.‑A., Good, M., Hansen, N. C., Lartillot, O., Martin, R., Nielsen, N., and Rosas, F. (2020). MusicLab Copenhagen Dataset. OSF.
- Huang, R., Holzapfel, A., Sturm, B., and Kaila, A.‑K. (2023). Beyond diverse datasets: Responsible MIR, interdisciplinarity, and the fractured worlds of music. Transactions of the International Society for Music Information Retrieval, 6(1), 43–59.
- Huron, D. (2008). Sweet Anticipation: Music and the Psychology of expectation. MIT Press.
- Jensenius, A. (2007). Action–Sound: Developing methods and tools to study music‑related body movement [PhD thesis]. University of Oslo.
- Lacey, S., Nguyen, J., Schneider, P., and Sathian, K. (2020). Crossmodal visuospatial effects on auditory perception of musical contour. Multisensory Research, 34(2), 113–127.
- Law, E., West, K., and Mandel, M. (2009). Evaluation of algorithms using games: The case of music tagging. In Proceedings of the 10th International Society for Music Information Retrieval Conference, pp. 387–392. ISMIR.
- Leman, M. (2007). Embodied Music Cognition and Mediation Technology. MIT Press.
- Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J.‑R., and Hu, D. (2022). Learning to answer questions in dynamic audio‑visual scenarios. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 19086–19096. IEEE.
- Lin, Y.‑B., Sung, Y.‑L., Lei, J., Bansal, M., and Bertasius, G. (2023). Vision transformers are parameter‑efficient audio‑visual learners. arXiv:2212.07983 [cs, eess].
- Liu, S., Hussain, A. S., Sun, C., and Shan, Y. (2024a). Music understanding LLaMA: Advancing text‑to‑music generation with question answering and captioning. In ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 286–290. ISSN: 2379190X.
- Liu, X., Dong, Z., and Zhang, P. (2024b). Tackling data bias in MUSIC‑AVQA: Crafting a balanced dataset for unbiased question‑answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4478–4487. IEEE.
- Manco, I., Benetos, E., Quinton, E., and Fazekas, G. (2021). MusCaps: Generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. arXiv:2104.11984 [cs, eess].
- McAdams, S., Depalle, P., and Clarke, E. (2004).
CHAPTER 8 Analyzing musical sound . In E. Clarke and N. Cook (Eds.), Empirical Musicology: Aims, Methods, Prospects. Oxford University Press. - Müller, M., Konz, V., Clausen, M., Ewert, S., and Fremerey, C. (2010). A multimodal way of experiencing and exploring music. Interdisciplinary Science Reviews, 35(2), 138–153.
- Moura, N., Fonseca, P., Goethel, M., Oliveira‑Silva, P., Vilas‑Boas, J. P., and Serra, S. (2023). The impact of visual display of human motion on observers’ perception of music performance. PLoS ONE, 18(10), Article e0292387.
- Patel, A. D. (2007).
Afterword . In A. D. Patel (Ed.), Music, Language, and the Brain. Oxford University Press. - Pilkov, I., Bishop, L., and Cancino‑Chacón, C. (2024). Hanon hands dataset. 10.5281/zenodo.12810303 (In Preparation).
- Platz, F., and Kopiez, R. (2012). When the eye listens: A meta‑analysis of how audio‑visual presentation enhances the appreciation of music performance. Music Perception, 30(1), 71–83.
- Polak, R., Tarsitani, S., and Clayton, M. (2018). IEMP Malian Jembe. OSF.
- Schutz, M. (2008). Seeing music? What musicians need to know about vision. Empirical Musicology Review, 3(3), 83–108.
- Shimojo, S., and Shams, L. (2001). Sensory modalities are not separate modalities: Plasticity and interactions. Current Opinion in Neurobiology, 11(4), 505–509.
- Tagg, P. (2012). Music’s Meanings: A Modern Musicology for Non‑Musos. Mass Media’s Scholar’s Press.
- Team, G. (2024). Gemini: A family of highly capable multimodal models. arXiv:2312.11805 [cs].
- Tsay, C.‑J. (2013). Sight over sound in the judgment of music performance. Proceedings of the National Academy of Sciences, 110(36), 14580–14585.
- Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., and Zafeiriou, S. (2017). End‑to‑end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1301–1309.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2023). Attention is all you need. arXiv:1706.03762 [cs].
- Vines, B. W., Krumhansl, C. L., Wanderley, M. M., Dalca, I. M., and Levitin, D. J. (2011). Music to my eyes: Cross‑modal interactions in the perception of emotions in musical performance. Cognition, 118(2), 157–170.
- Wang, K., Tian, Y., and Hatzinakos, D. (2024).
Towards efficient audio‑visual learners via empowering pre‑trained vision transformers with cross‑modal adaptation . In CVPR Workshop. IEEE. - Weck, B., Manco, I., Benetos, E., Quinton, E., Fazekas, G., and Bogdanov, D. (2024). MuChoMusic: Evaluating music understanding in multimodal audio‑language models. arXiv:2408.01337 [cs, eess].
- Yang, P., Wang, X., Duan, X., Chen, H., Hou, R., Jin, C., and Zhu, W. (2022). AVQA: A dataset for audio‑visual question answering on videos. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, pp. 3480–3491. ACM.
