Have a personal or library account? Click to login
Reductive, Exclusionary, Normalising: The Limits of Generative AI Music Cover

Reductive, Exclusionary, Normalising: The Limits of Generative AI Music

Open Access
|Sep 2025

References

  1. Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N., and Frank, C. (2023). MusicLM: Generating music from text. arXiv:2301.11325 [cs, eess].
  2. Alvarado, R. (2023). AI as an epistemic technology. Science and Engineering Ethics, 29(5), 32.
  3. Austin, L., Cage, J., and Hiller, L. (1992). An interview with John Cage and Lejaren Hiller. Computer Music Journal, 16(4), 15.
  4. Bajohr, H. (2024). Whoever Controls Language Models Controls Politics (pp. 189195). Walther König.
  5. Barad, K. (2007). Meeting the Universe Halfway: Quantum Physics and the Entanglement of Matter and Meaning (p. 542). Duke University Press.
  6. Baudrillard, J. (1994). Simulacra and Simulation. University of Michigan Press.
  7. Born, G. (2010). For a relational musicology: Music and interdisciplinarity, beyond the practice turn: The 2007 Dent Medal Address. Journal of the Royal Musical Association, 135(2), 205243.
  8. Born, G. (2020). Diversifying MIR: Knowledge and real‑world challenges, and new interdisciplinary futures. Transactions of the International Society for Music Information Retrieval, 3(1), 193204.
  9. Born, G. (2021). Artificial Intelligence, Music Recommendation, and the Curation of Culture: A White Paper (p. 27). Schwartz Reisman Institute for Technology and Society, University of Toronto.
  10. Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., Roblek, D., Teboul, O., Grangier, D., Tagliasacchi, M., and Zeghidour, N. (2023). AudioLM: A language modeling approach to audio generation. arXiv:2209.03143 [cs].
  11. Campagna, F. (2018). Technic and Magic: The Reconstruction of Reality. Bloomsbury.
  12. Campagna, F. (2021). Prophetic Culture (pp. 1280). Bloomsbury Publishing.
  13. Chomsky, N. (1957). Syntactic Structures. Mouton & Co.
  14. Chung, Y.‑A., Zhang, Y., Han, W., Chiu, C.‑C., Qin, J., Pang, R., and Wu, Y. (2021). W2v‑BERT: Combining contrastive learning and masked language modeling for self‑supervised speech pre‑training. arXiv:2108.06209 [cs].
  15. Cope, D. (1992). A computer model of music composition. In Machine Models of Music (pp. 403425).
  16. Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., and Défossez, A. (2024). Simple and controllable music generation. arXiv:2306.05284 [cs, eess].
  17. Derrida, J. (1976). Of Grammatology (1st American ed.). Johns Hopkins University Press.
  18. Desai, J., Watson, D., Wang, V., Taddeo, M., and Floridi, L. (2022). The epistemological foundations of data science: A critical analysis. SSRN Electronic Journal.
  19. Ebcioğlu, K. (1990). An expert system for harmonizing chorales in the style of J.S. Bach. The Journal of Logic Programming, 8(1–2), 145185.
  20. Elizalde, B., Deshmukh, S., Ismail, M. A., and Wang, H. (2022). CLAP: Learning audio concepts from natural language supervision. arXiv:2206.04769 [cs, eess].
  21. Espeland, W., and Sauder, M. (2007). Rankings and reactivity: How public measures recreate social worlds. American Journal of Sociology, 113(1), 140.
  22. Evans, Z., Parker, J. D., Carr, C. J., Zukowski, Z., Taylor, J., and Pons, J. (2024). Long‑form music generation with latent diffusion. arXiv:2404.10301 [cs, eess].
  23. Fazi, M. B. (2021). Beyond human: Deep learning, explainability and representation. Theory, Culture & Society, 38(7–8), 5577.
  24. Foucault, M. (1995). Discipline and Punish: The Birth of the Prison (2nd Vintage Books ed.). Vintage Books.
  25. Fux, J. J., Mann, A., and Edmunds, J. (1971). The study of counterpoint from Johann Joseph Fux’s Gradus ad Parnassum (Rev. ed., 32nd printing). W. W. Norton.
  26. Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. (2017). Audio Set: An ontology and human‑labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776780. IEEE.
  27. Giraud, E. H. (2019). What Comes After Entanglement? Activism, Anthropocentrism, and an Ethics of Exclusion. Duke University Press.
  28. Griffith, N., and Todd, P. M. (1999). Frankensteinian methods for evolutionary music composition. In Musical Networks. The MIT Press.
  29. Heidegger, M. (1927). Being and Time. Blackwell Publishing.
  30. Hiller, L., and Isaacson, L. M. (1979). Experimental Music: Composition with an Electronic Computer. Greenwood Press.
  31. Holzapfel, A., Sturm, B. L., and Coeckelbergh, M. (2018). Ethical dimensions of music information retrieval technology. Transactions of the International Society for Music Information Retrieval, 1(1), 4455.
  32. Huang, A., Sturm, B. L. T., and Holzapfel, A. (2021). De‑centering the west: East Asian philosophies and the ethics of applying artificial intelligence to music. In Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, pp. 301309.
  33. Huang, Q., Jansen, A., Lee, J., Ganti, R., Li, J. Y., and Ellis, D. P. W. (2022). MuLan: A joint embedding of music audio and natural language. arXiv:2208.12415 [cs, eess, stat].
  34. Huang, Q., Park, D. S., Wang, T., Denk, T. I., Ly, A., Chen, N., Zhang, Z., Zhang, Z., Yu, J., Frank, C., Engel, J., Le, Q. V., Chan, W., Chen, Z., and Han, W. (2023). Noise2Music: Text‑conditioned music generation with diffusion models. arXiv:2302.03917 [cs, eess].
  35. Husain, A. (2021). Replace Me. Peninsula Press.
  36. Jakobson, R. (1960). Linguistics and Poetics. Selected Writings III/Mouton.
  37. James, R. (2019). The Sonic Episteme: Acoustic Resonance, Neoliberalism, and Biopolitics. Duke University Press.
  38. Joque, J. (2022). Revolutionary Mathematics. Verso.
  39. Juslin, P. N., and Sloboda, J. A. (Eds.). (2010). Handbook of Music and Emotion: Theory, Research, Applications (p. xiv, 975). Oxford University Press.
  40. Kristeva, J. (1993). Revolution in Poetic Language (nachdr ed.). Columbia University Press.
  41. Li, S., and Sung, Y. (2023). MelodyDiffusion: Chord‑conditioned melody generation using a transformer‑based diffusion model. Mathematics, 11(8), 1915.
  42. Liao, W., Takida, Y., Ikemiya, Y., Zhong, Z., Lai, C.‑H., Fabbro, G., Shimada, K., Toyama, K., Cheuk, K., Martínez‑Ramírez, M. A., Takahashi, S., Uhlich, S., Akama, T., Choi, W., Koyama, Y., and Mitsufuji, Y. (2024). Music foundation model as generic booster for music downstream tasks. arXiv:2411.01135 [cs].
  43. Ma, Y., Øland, A., Ragni, A., Del Sette, B. M., Saitis, C., Donahue, C., Lin, C., Plachouras, C., Benetos, E., Quinton, E., Shatri, E., Morreale, F., Zhang, G., Fazekas, G., Xia, G., Zhang, H., Manco, I., Huang, J., Guinot, J., … Wang, Z. (2024). Foundation models for music: A survey. arXiv:2408.14340 [cs, eess].
  44. Manco, I., Benetos, E., Quinton, E., and Fazekas, G. (2022). Contrastive audio‑language learning for music. arXiv:2208.12208 [cs].
  45. McCulloch, W. S., and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115133.
  46. McPherson, A., and Lepri, G. (2020). Beholden to our tools: Negotiating with technology while sketching digital instruments. In Proceedings of the International Conference on New Interfaces for Musical Expression, p. 6. Birmingham City University.
  47. Merleau‑Ponty, M. (1945). Phenomenology of Perception. Routledge.
  48. Morreale, F. (2021). Where does the buck stop? Ethical and political issues with AI in music creation. Transactions of the International Society for Music Information Retrieval, 4(1), 105113.
  49. Morreale, F. (2025). Human subsumption in training datasets for music generation. In The Inner World of AI. Routledge.
  50. Morreale, F., Sharma, M., and Wei, I.‑C. (2023). Data collection in music generation training sets: A critical analysis. In Proceedings of the 24th International Society for Music Information Retrieval Conference, Milan, Italy.
  51. Morrison, L., and McPherson, A. (2024). Entangling entanglement: A diffractive dialogue on HCI and musical interactions. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 117. ACM.
  52. Offenhuber, D. (2024). Shapes and frictions of synthetic data. Big Data & Society, 11(2), Article 20539517241249390.
  53. Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. arXiv:1609.03499 [cs].
  54. Pasquinelli, M. (2023). The Eye of the Master: A Social History of Artificial Intelligence. Verso.
  55. Saussure, F. d. (1916). Cours de Linguistique Générale (Éd. critique [Reprint of the 1916 edition]). Payot.
  56. Schneider, F., Kamal, O., Jin, Z., and Schölkopf, B. (2023). Moûsai: Text‑to‑music generation with long‑context latent diffusion. arXiv:2301.11757 [cs].
  57. Schottstaedt, B. (1984). Automatic Species Counterpoint (No. 19). CCRMA, Department of Music, Stanford University.
  58. Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379423.
  59. Simondon, G. (2011). On the mode of existence of technical objects. Deleuze Studies, 5(3), 407424.
  60. Simondon, G. (2020). Individuation in Light of Notions of Form and Information (Vol. 1). University of Minnesota Press.
  61. Stiegler, B. (1998). Technics and Time. Stanford University Press.
  62. Tao, Y., Viberg, O., Baker, R. S., and Kizilcec, R. F. (2024). Cultural bias and cultural alignment of large language models. PNAS Nexus, 3(9), 346.
  63. van den Oord, A., Dieleman, S., and Schrauwen, B. (2013). Deep content‑based music recommendation. In C. J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems (Vol. 26). Curran Associates, Inc.
  64. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, U., and Polosukhin, I. (2017). Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
  65. Wang, Z., Min, L., and Xia, G. (2024). Whole‑song hierarchical generation of symbolic music using cascaded diffusion models. arXiv:2405.09901 [cs].
  66. Watson, D. S. (2023). On the philosophy of unsupervised learning. Philosophy & Technology, 36(2), 28.
  67. Weatherby, L., and Justie, B. (2022). Indexical AI. Critical Inquiry, 48(2), 381415.
  68. Wu, S.‑L., Donahue, C., Watanabe, S., and Bryan, N. J. (2023). Music ControlNet: Multiple time‑varying controls for music generation. arXiv:2311.07069 [cs, eess].
  69. Yang, L.‑C., Chou, S.‑Y., and Yang, Y.‑H. (2017). MidiNet: A convolutional generative adversarial network for symbolic‑domain music generation. arXiv:1703.10847 [cs].
  70. Yu, B., Lu, P., Wang, R., Hu, W., Tan, X., Ye, W., Zhang, S., Qin, T., and Liu, T.‑Y. (2022). Museformer: Transformer with fine‑ and coarse‑grained attention for music generation. arXiv:2210.10349 [cs].
  71. Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., and Tagliasacchi, M. (2022). SoundStream: An end‑to‑end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 495507.
  72. Zhao, M., Zhong, Z., Mao, Z., Yang, S., Liao, W.‑H., Takahashi, S., Wakaki, H., and Mitsufuji, Y. (2024). OpenMU: Your Swiss Army knife for music understanding. arXiv:2410.15573 [cs].
DOI: https://doi.org/10.5334/tismir.256 | Journal eISSN: 2514-3298
Language: English
Submitted on: Feb 13, 2025
Accepted on: Aug 13, 2025
Published on: Sep 4, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Fabio Morreale, Marco A. Martinez-Ramirez, Raul Masu, WeiHsiang Liao, Yuki Mitsufuji, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.