The AI Music Arms Race: On the Detection of AI-Generated Music

Laura Cros Vila; Bob L. T. Sturm; Luca Casini; David Dalmazzo

doi:10.5334/tismir.254

Full Article

1 Introduction

There now exist numerous companies applying artificial intelligence (AI) to music generation, e.g., Aiva, Boomy, Riffusion, Suno, and Udio—not to mention several major corporations exploring similar directions, such as Google, Meta, Microsoft, Open AI, Stability AI, and Spotify. AI music technologies are becoming more capable of creating music that can compete with that made by humans^¹ and at scales never seen before. AI is transforming the music industry landscape in major ways. According to a 2024 study commissioned by musicians’ rights organizations GEMA and SACEM (Goldmedia, 2024), of their members surveyed demanded that AI music be clearly identified, reflecting calls for transparency in the music industry. The survey findings indicated as well that of German and French music creators worry AI could make it difficult for them to sustain careers in music by pushing human‑made music to the margins.

The scale of AI music production raises concerns about the possibility of AI music dominating online catalogs, eventually rendering humans unable to compete in a music industry increasingly driven by automation (Sturm et al., 2024). Alex Mitchell (Boomy CEO) predicts AI could produce hundreds of billions of songs annually. He reflects on the implications of this growth, stating: ‘Whoever is there first and whoever is doing that—the spoils on the other side of that thing are just going to be ridiculous‘ (Velardo, 2022). The practical implications are already evident. At one time, Spotify took steps to regulate AI‑generated content on its platform, but now it appears to be using AI music for its catalog, which is perpetuating issues of human–AI competition (Pelly, 2025). Additionally, content filtering enables listeners and platforms to maintain transparency, allowing audiences to be informed of when they are listening to AI‑generated music (Holzapfel et al., 2018).

A critical challenge that is emerging, then, seems to be the automated detection of music generated by AI (herein termed ‘AI music’). This can be important for content‑identification systems, copyright enforcement, and music‑recommendation systems (Agostinelli et al., 2023). Some companies, such as Figaro,^² Deezer,^³ and IRCAM Amplify,^⁴ are responding to this challenge by developing systems for AI music detection. Similar challenges exist in the voice domain, where researchers have developed methods to detect voice cloning to prevent potential misuse like fraud and impersonation (Ren et al., 2024). Outside of the domain of sound, research is being conducted to detect AI‑generated text (Liang et al., 2023; Walters, 2023; Weber‑Wulff et al., 2023) and images (Nguyen et al., 2022).

This article explores the task of AI music detection. It shares fundamental similarities with two classic MIR tasks, composer identification and music auto‑tagging, but also presents unique challenges. Like composer identification, it involves analyzing musical features to determine the author of the music; however, for AI music, this could involve a heterogeneous collection of many different music styles, which can complicate identification. AI music detection could be seen as more akin to identifying a record label than a specific composer. AI detection can also be viewed as a type of music auto‑tagging, as done in the development of IRCAM Amplify’s AI Music Detector (Dauban, 2024). This tool treats AI music detection like identifying instruments or audio characteristics, such as timbre, aligning it with other classification tasks like genre or instrument recognition. These tasks use similar techniques, such as feature extraction, pattern recognition, and statistical inference methods. Additionally, AI music may have unique ‘fingerprints’ or ‘watermarks’ that could simplify detection if they are there; however, the presence of such extra‑musical information may be proprietary and provide unreliable means for detection (Afchar et al., 2024).

To explore the task of AI music detection, we have collected a large dataset of music recordings generated by users of two popular AI music platforms: Suno and Udio. Contemporaneously with our work, Rahman et al. (2025) introduced the SONICS dataset, which focuses on detecting the AI‑generated music they create using the platforms following a template‑based approach, i.e., using a large language model to generate prompts from lists of possibilities. In contrast, our dataset is directly sourced from music promoted by Suno and Udio on their homepage, resulting in a representative collection of AI‑generated music ‘in the wild’ (but the possibility exists that the SONICS dataset and our own partially overlap).

The contributions of this paper are threefold: (1) a novel dataset of AI‑generated music from Suno and Udio, capturing organic user interactions rather than controlled prompts (contrasting with SONICS (Rahman et al., 2025)); (2) experiments showing that the performance of state‑of‑the‑art systems for AI music detection are comparable to that of simpler methods (e.g., Contrastive Language–Audio Pretraining [CLAP] embeddings with support vector machines [SVMs]); and (3) insights into detection vulnerabilities (e.g., reliance on sampling rates and high‑frequency artifacts).

In the next section, we review AI music, the tasks of composer identification and auto‑tagging, and research in detecting AI content other than music. Section 3 describes our method. Section 4 describes how we constructed our dataset and the feature‑extraction methods we used. Section 5 explores our dataset, comparing various features and visualizing the embeddings to understand the characteristics and structure of the data. Section 6 discusses how we train and evaluate AI music detection systems and compares their performance with other systems. Section 7 presents results of experiments observing how the performances of these systems are sensitive to audio transformations. Sections 8 and 9 discuss our findings and present directions of future work. Finally, Section 10 provides a link to access our code for reproducibility, while Section 11 discusses the ethics of our work.

2 Background

2.1 AI music platforms

In this article, we consider AI music as music generated by platforms that market themselves as using AI to create music. Two of the most popular platforms for AI music are Suno^⁵ and Udio,^⁶ both offering various models and settings to generate audio recordings of full songs with natural‑sounding performances. The Suno platform provides two modes for generating music. In the ‘default mode,’ users can describe the song they want, e.g., ‘a futuristic K‑pop song about dancing all night long.’ The ‘custom mode’ gives users more control, e.g., by specifying lyrics and style. Suno also provides options to make instrumental songs, remove vocals, or upload an existing audio file for conditioning. Users can also interact with generated songs, e.g., extending them or reusing the corresponding prompt. Udio offers similar features, allowing users to generate music through text prompts, lyrics, and musical styles.

2.2 Related datasets and detection efforts

The SONICS dataset (Rahman et al., 2025) contains over k songs generated by Suno and Udio. These songs were created using structured prompts based on predefined lyrics and style templates. The dataset was designed to study how models handle long‑range temporal structures in music. However, because the songs were generated in controlled settings, they may not reflect organic user interactions. For instance, the dataset categorizes outputs as ‘Full Fake,’ ‘Mostly Fake,’ or ‘Half Fake’ based on prompt structures, which may not fully capture the variability of organic user interactions.

Rahman et al. (2025) introduced a novel machine learning architecture capable of handling long audio contexts, named the Spectro‑Temporal Tokens Transformer (SpecTTTra). This model splits the input spectrogram into two orthogonal components: temporal clips (along the time axis) and spectral clips (along the frequency axis). Each component is processed separately via linear projections to generate tokens, with positional embeddings added to encode temporal and spectral positions. These tokens are then fused and processed through a transformer encoder with global attention, enabling the model to capture long‑range dependencies across both time and frequency domains. Rahman et al. (2025) showed that SpecTTTra achieves higher F1 scores on long‑duration audio than other models while reducing memory usage and inference time.

The work by Afchar et al. (2024) explores music deepfake detection by training classifiers to distinguish between original audio and reconstructions generated through autoencoder decoders. Their approach achieves high accuracy () in detecting artifacts introduced by specific waveform‑based decoders (e.g., Encodec, DAC, GriffinMel) when applied to compressed or reconstructed real‑world music. They found that their performance degrades when faced with common audio manipulations (e.g., pitch shifting, re‑encoding) or unseen decoder architectures. These limitations highlight the need for robustness against evasion tactics, improved calibration, and ethical considerations such as interpretability and recourse, as the authors argue that technical success alone is insufficient for real‑world deployment.

2.3 Composer identification and auto‑tagging

Composer identification and music auto‑tagging are foundational tasks in MIR that exhibit methodological overlap with AI music detection. Composer identification aims to determine the creator of a musical piece based on its stylistic, harmonic, and structural features, often leveraging probabilistic models and neural networks (Kaliakatsos‑Papakostas et al., 2011; Manaris et al., 2005; Wołkowicz et al., 2008). Detecting a specific composer entails finding and looking for the traits of their music. Detecting an AI music composer may be more difficult as the range of its stylistic output increases. Music auto‑tagging assigns multiple descriptive labels to music data such as mood, style, and instrumentation (Bertin‑Mahieux et al., 2010; Choi et al., 2017; Pons and Serra, 2019). Tags can also include the source of the music, such as the composer or artist. The complexity of this problem grows as the number and diversity of tags grow since any solution must be sensitive to more and more concepts. Both of these tasks involve transforming audio data into feature representations that capture relevant information (Müller, 2015). These include handcrafted features (spectral, rhythm, etc.) or learned ones like embeddings computed by the CLAP model (Wu et al., 2023).

2.4 Detecting AI‑generated content that is not music

The detection of AI‑generated content has been a growing area of research, starting with systems detecting synthetic images and text. These include detecting images where a person’s likeness is replaced with someone else’s (deepfakes) (Nguyen et al., 2022), AI‑generated text (Liang et al., 2023; Walters, 2023; Weber‑Wulff et al., 2023), an AI‑synthesized speaking voice (Ren et al., 2024; Sun et al., 2023; Wang et al., 2020), and an AI‑synthesized singing voice (Zang et al., 2024). Approaches for AI‑synthesized voice detection include analyzing acoustic and prosodic features with deep learning models (Ren et al., 2024), monitoring neuron activation patterns in speaker recognition systems (Wang et al., 2020), and detecting artifacts introduced by neural vocoders (Sun et al., 2023). AI‑synthesized voice detection shares similarities with AI music detection as both involve distinguishing human‑made from AI‑generated audio. Music is quite different from speech and singing because music is polyphonic and multi‑timbral and so requires different approaches and features.

3 Method

To develop and study AI music detection systems, we first compile a dataset of AI‑generated music from Suno and Udio—two popular platforms that are actively being developed and produce high‑quality output—and non‑AI music from the Million Song Dataset (MSD) (Bertin‑Mahieux et al., 2011). We analyze our collected audio data by extracting Essentia descriptors (Bogdanov et al., 2013) and CLAP embeddings (Wu et al., 2023). We assess the statistics of these features and visualize the CLAP embeddings of the collection using UMAP (McInnes et al., 2018). We then create an AI music detection system operating on CLAP embeddings using a hierarchical classification strategy based on the local classifier per node (LCN) approach (Miranda et al., 2023). We measure performance using precision, recall, F1 score, and confusion matrices. We compare the performance of our system to the system SpecTTTra proposed by Rahman et al. (2025), as well as the commercial IRCAM Amplify AI Music Detector.^⁷ This system is claimed to have high accuracy in detecting AI music,^⁸ achieving an accuracy of for AI music in general. For ‘natural music’ (non‑AI in our context), the accuracy is reported to be . It also provides a ‘confidence’ score for its prediction, but there is no description of how that is computed. On the other hand, the open‑source model SpecTTTra has a reported overall F1 score of with its largest model (SpecTTTra‑, with a context of s). We also look at the performance of these systems on an out‑of‑sample test set of music generated by a different AI music platform: Boomy. Finally, we investigate how sensitive these AI detection systems are to audio transformations (low‑pass filtering, high‑pass filtering, and downsampling). It should be emphasized that we are not proposing a way to perform AI music detection but instead seek to investigate its meaningfulness as an MIR task and the important considerations one must make about the problem definition and experimental procedures.

4 Data Collection

Since Suno and Udio do not provide public APIs for data access, we used web‑scraping techniques to collect audio recordings from them between May and October 2024. We acknowledge potential legal ambiguities under the terms of service of these platforms, which prohibit systematic scraping; however, our non‑commercial academic research is allowed under exceptions in the Copyright Directive (EU) 2019/790 (European Union, 2019). (See the supplementary material for more information.) For obtaining audio recordings from Suno, we used the ‘New Songs’ playlist,^⁹ which is regularly updated with 50 new generated songs. We queried this playlist every two hours using an automated process. The extracted HTTP request was structured to target the unique playlist identifier and retrieve the associated metadata. For obtaining audio recordings from Udio, we used the search functionality of their website to obtain songs based on their popularity, specifically by tracking those with the highest number of plays 24 hours before the search. Metadata, including prompts and lyrics, were collected alongside audio. We collected recordings from each platform having durations over s.

We sourced non‑AI music from the MSD SQLite database found on GitHub,^¹⁰ extracting YouTube URLs from each song’s Last.fm page^¹¹ and downloading the audio using the command‑line application yt‑dlp.^¹² We repeated this process until we obtained unique audio files. It is important to note that MSD contains a large volume of mainstream commercial Western music and is likely included in the training data of Suno and Udio (US District Court for the District of Massachusetts, 2024a; US District Court for the District of Massachusetts, 2024b). This overlap could affect evaluations of originality or model bias.

We extract Essentia music descriptors (e.g., spectral centroid, BPM, chords) from each recording, summarizing them via statistical distributions across frames. For more details, refer to the Essentia documentation^¹³ and code.^¹⁴ We also calculated CLAP embeddings for each audio recording. The CLAP model we use is pretrained on music.^¹⁵ Briefly, the model extracts ‑bin log‑Mel spectrograms from ‑second consecutive segments of mono audio sampled at kHz, using a ‑sample window and ‑sample hop size. A transformer processes these into a ‑dimensional latent representation, which is then decoded into a ‑dimensional vector aligned with text embeddings. This ‑dimensional vector serves as the final audio representation. In summary, we have Essentia descriptors and CLAP embeddings for audio recordings: randomly selected from MSD, scraped from Udio, and scraped from Suno.

5 Preliminary Analysis of The Dataset

In this section, we look at various properties of our dataset, using both Essentia descriptors and CLAP embeddings.

5.1 Feature analysis

We compare and analyze the following features from the Essentia descriptors: Average Loudness, Spectral Centroid (mean), BPM, Duration (s), Pitch Salience (mean), Dynamic Complexity, Tuning Diatonic Strength, and Chord Change Rate. Table 1 summarizes these features’ averages and standard deviations, while Figure 1 visualizes their distributions. As shown, the descriptors from Udio are more similar to those from MSD than those from Suno. Average Loudness is consistent across all three sources. Suno has a lower average Spectral Centroid, suggesting reduced high‑frequency content, while Udio and MSD exhibit higher values. BPM distributions for all three sources display a bimodal pattern (peaks at BPMs). Duration distributions show Suno and Udio being tightly constrained, likely reflecting generation limits imposed by these platforms, while MSD has a smoother and broader spread of durations, indicative of more natural variation. Pitch Salience is lower in Suno, consistent with its reduced Spectral Centroid, indicating weaker harmonic content. Other features are broadly similar across sources, though Udio aligns more closely with MSD in some cases.

Table 1

Mean and standard deviation values of selected Essentia descriptors across sources: Suno, Udio, and MSD.

Feature	Suno	Udio	MSD
Average Loudness
Spectral Centroid
BPM
Duration (s)
Pitch Salience
Dynamic Complexity
Tuning Diatonic Strength
Chord Change Rate

Distributions of selected Essentia descriptors across sources. The mean and standard deviations of these descriptors are provided in Table 1

We perform importance analysis on all Essentia descriptors we extract in differentiating between sources (Suno, Udio, or MSD). We split the datasets into training and testing sets with an ratio and train a Random Forest classifier with estimators using scikit‑learn (Pedregosa et al., 2011). The classifier achieves strong results across all classes. For Suno, we obtain a precision of , recall of , and F1 score of . Separately, Udio yields a precision of , recall of , and F1 score of , while MSD demonstrates a precision of , recall of , and F1 score of .

We compute the mean and standard deviation values of the impurity decreases accumulated across all nodes in each decision tree where a descriptor is used to split the data. Impurity measures how mixed the data is at a node, and the impurity decrease represents how much a descriptor reduces that uncertainty during a split. The accumulation reflects the descriptor’s overall contribution to improving the model’s decision‑making. Descriptors that frequently reduce impurity across trees are considered more important for distinguishing between classes.

The most indicative descriptor we find is is the standard deviation of the skewness of the mel bands. MSD has a skewness of , Suno has that of , and Udio has that of . The next descriptors we find relevant are the bark bands kurtosis mean and standard deviation. These provide insights about the energy distribution and sharpness of spectral peaks across the Bark bands. Suno’s higher mean kurtosis () suggests more extreme spectral peaks, possibly from sharper transient sounds or highly dynamic frequency content. Meanwhile, the kurtosis standard deviation captures how this peak sharpness varies over time within each track. Again, Suno shows much greater variability () than MSD () and Udio ().

We further analyze properties of the audio file formats. We find that the bit rate of MSD varies significantly, with a mean of kbps, whereas all audio from Suno and Udio have fixed bit rates of kbps and kbps, respectively. For the MSD dataset, the sample rate for the audio files is kHz for 60% of them and kHz for the rest. Almost all audio files in Suno and Udio have a sample rate of kHz, except for three files in Suno, which have a sample rate of kHz.

5.2 Visualization of the dataset

We now project CLAP embeddings of all audio recordings into a two‑dimensional space using UMAP (McInnes et al., 2018). The parameters we use are a minimum distance of , which controls how tightly the UMAP projection packs points together; neighbors, which balances local versus global structure; and Manhattan as the distance metric. We set these values through trial and error, aiming for the visualization that more clearly shows the different clusters. Figure 2 shows how the Udio and MSD clusters seem to overlap with the Suno clusters further from the central MSD cluster. We provide interactive two‑ and three‑dimensional UMAP projections for further exploration.^¹⁶ We can see how the Udio cluster appears closer to the MSD cluster than the Suno cluster.

Two‑dimensional UMAP projection of Contrastive Language–Audio Pretraining audio embeddings from our dataset, with color denoting source: *Suno* (blue), *Udio* (red), and *MSD* (green).

6 AI Music Detection

We next explore the detection of AI music using CLAP embeddings of recordings from our dataset. We take a hierarchical approach to classification using the LCN approach. This technique involves constructing a unique classifier for each node within a class tree, effectively handling hierarchical relationships between classes, such as the distinction between parents (AI and non‑AI music) and children (Suno, Udio, and MSD). This hierarchical structure enables us to handle complex relationships between classes that flat classification approaches may not capture (Silla and Freitas, 2011).

We partition the dataset into training, validation, and test sets with a ratio of , ensuring that similar or duplicate recordings do not appear across different splits. To do this, we use metadata associated with each music audio track: for Suno and Udio, we group by ‘prompt,’ and, for MSD, we group by ‘artist.’ This prevents different recordings containing the same audio or part of it (e.g., they have the same first but the rest is different) from being split across sets. After partitioning, we normalized the CLAP embeddings by subtracting the empirical mean and dividing by the empirical standard deviation of the features in the training set.

We use SVMs, random forests, and K‑nearest neighbors (KNN), evaluating each with precision, recall, and F1 score at both the parent (AI vs. non‑AI) and child (Suno, Udio, MSD) levels. Most classifier parameters follow Scikit‑learn defaults. The SVM, implemented via sklearn’s SVC class, uses an RBF kernel with probability estimation enabled, , gamma set to ‘scale’, and a fixed random seed for reproducibility. The random forest builds 100 trees using Gini impurity, with no depth limit, allowing full tree growth. Our 5‑NN classifier uses uniform weights, Euclidean distance, and Scikit‑learn’s auto algorithm to choose the optimal neighbor search method.

6.1 Results and discussion

We evaluate our hierarchical classification approach on parent and child levels using a sample set of samples per source ( total). We do this as we have limited credits to use the commercial baseline IRCAM Amplify. Results are summarized in Tables 2, 3, and 4.

Table 2

Parent‑level classification results (AI vs. non‑AI) on the sample set.

Classifier	Class	Precision	Recall	F1 Score	Support
SVM	Non‑AI	0.958	0.913	0.935	150
	AI	0.958	0.980	0.969	300
RF	Non‑AI	0.955	0.840	0.894	150
	AI	0.925	0.980	0.951	300
5‑NN	Non‑AI	0.961	0.820	0.885	150
	AI	0.916	0.983	0.949	300
IRCAM Amplify	Non‑AI	1.000	0.953	0.976	150
	AI	0.977	1.000	0.988	300
SpecTTTra	Non‑AI	0.528	0.893	0.663	150
	AI	0.918	0.600	0.726	300

Table 3

Normalized confusion matrices for various classifiers on the sample set, classified at the parent level.

Predicted True	SVM		RF		5‑NN		IRCAM Amplify		SpecTTTra
Predicted True	Non‑AI	AI	Non‑AI	AI	Non‑AI	AI	Non‑AI	AI	Non‑AI	AI
MSD	0.913	0.087	0.840	0.160	0.820	0.180	0.953	0.047	0.893	0.107
Suno	0.007	0.993	0.000	1.000	0.000	1.000	0.000	1.000	0.047	0.953
Udio	0.033	0.967	0.040	0.960	0.033	0.967	0.000	1.000	0.753	0.247

Table 4

Child‑level classification results (MSD, Suno, Udio) on the sample set.

Classifier	Class	Precision	Recall	F1 Score	Support
SVM	MSD	0.951	0.907	0.928	150
	Suno	0.970	0.867	0.915	150
	Udio	0.815	0.940	0.873	150
RF	MSD	0.947	0.833	0.887	150
	Suno	0.943	0.767	0.846	150
	Udio	0.699	0.913	0.792	150
5‑NN	MSD	0.953	0.813	0.878	150
	Suno	0.714	0.980	0.826	150
	Udio	0.776	0.600	0.677	150

Table 2 shows the parent‑level classification results, where IRCAM Amplify achieves the highest F1 scores ( for non‑AI, for AI). For non‑AI music, the perfect precision of indicates correct identification of all human‑made samples in our sample set. However, its recall of means the classifier correctly identifies of human‑made music samples, misclassifying the remaining as AI‑generated—a significant portion that could result in millions of errors when scaled to music catalogs exceeding million items. For AI music, the classifier achieves a recall of , detecting all AI‑generated samples in our sample set, but its precision of implies that of its AI predictions include non‑AI samples. While this combination of high precision for non‑AI and high recall for AI highlights IRCAM Amplify’s strength in distinguishing between AI and non‑AI music, the false‑positive rate for human‑made content underscores a critical trade‑off: prioritizing the detection of AI content may come at the cost of mislabeling a non‑negligible number of human‑made tracks. This raises ethical concerns, as platforms purging AI content could inadvertently censor human art, harming artists and cultural heritage.

SpecTTTra underperforms, with F1 scores of (non‑AI) and (AI). Its precision for non‑AI is only , meaning it frequently misclassifies AI music as non‑AI, while its recall for AI is low at , indicating it misses many AI samples. For the other classifiers, the trade‑off between precision and recall varies. The SVM classifier, for example, achieves a precision of and recall of for AI music, leading to a balanced F1 score of . Random forest and 5‑NN, while maintaining high precision, have lower recalls for non‑AI music, indicating a tendency to over‑identify AI music at the expense of correctly classifying all non‑AI music.

Table 3 shows the normalized confusion matrices of our classifiers. We find that Udio is more frequently misclassified as non‑AI music than Suno, especially by SpecTTTra, which mislabels % of Udio samples as non‑AI, compared to only % of Suno samples. This suggests SpecTTTra may be overfit to Suno. Even other classifiers, such as 5‑NN, show a similar pattern: of Udio samples are classified as non‑AI, compared to just of Suno samples. Misclassifications also tend to occur in the direction of falsely identifying non‑AI music (MSD) as AI‑generated, rather than the opposite. In the case of our hierarchical classifiers, this could be due to data imbalance at the parent level, as there is twice the number of AI music samples (from Suno and Udio) in the training set than non‑AI samples.

Table 4 shows the child‑level classification results. Distinguishing between specific music sources (MSD, Suno, Udio) proves more challenging. SVM performs well on MSD ( F1) and Suno (), but its score drops to for Udio, a trend seen across classifiers. This suggests Udio’s features are less distinctive, possibly resembling non‑AI music more. In contrast, Suno is consistently easier to identify. This performance discrepancy points to the difficulty of identifying specific AI platforms.

We evaluate our classifiers on music audio recordings generated by users of the out‑of‑sample AI music platform Boomy,^¹⁷ which we downloaded from its website. These data present a unique challenge as they contain music generated by a different AI platform than those in our training set and, as far as we understand, they are not part of the IRCAM Amplify detector’s training set. We find that only of the Boomy recordings are identified as AI music by the IRCAM Amplify detector.^¹⁸ Our classifiers do not perform much different, with the SVM, random forest, and 5‑NN classifiers identifying only , , and Boomy tracks as AI‑generated, respectively. SpecTTTra outperforms these by identifying Boomy tracks as AI, but this still represents only a % rate of detection. This poor performance might stem from Boomy’s generative approach, which differs substantially from that of Suno and Udio. While not known for sure, the sound quality of Boomy suggests the platform generates music symbolically, which is then sequenced using samples and automatically mixed. This shows how AI music detection systems can seemingly perform very well on music generated by particular platforms but fail on others. SpecTTTra’s slight improvement over IRCAM Amplify on Boomy hints that its misclassification patterns (e.g., over‑reliance on Suno’s features) might occasionally align with Boomy’s audio characteristics, but its overall low accuracy underscores the broader challenge of cross‑platform generalization.

6.2 Cross‑platform AI music detection

We now investigate the generalizability of AI music detection using cross‑domain training and testing—that is, we train a detector to differentiate between the music audio of one AI platform and MSD, then test it on the other AI platform. We use the same CLAP embeddings and classifiers from our previous experiments. This approach allows us to assess whether the detection methods are learning source–specific artifacts or general characteristics of AI music.

Table 5 shows the cross‑domain classification results and reveals several insights about the generalizability of our detection methods. Classifiers achieve F1 scores of when trained and tested on the same source (e.g., SVM: for Suno). Interestingly, models trained on Udio seem to generalize well to Suno samples (F1 scores between and ), but the reverse is not true—models trained on Suno perform poorly when tested on Udio (F1 scores between and ). This asymmetric generalization suggests that Udio’s audio characteristics might encompass a broader range of features that are also present in Suno’s outputs. Among the classifiers, 5‑NN shows the most consistent performance across different training and testing combinations, albeit with slightly lower peak performance, suggesting it may be learning more generalizable features at the cost of source‑specific optimization.

Table 5

Cross‑domain classification results showing performance when training on one AI source plus MSD and testing on another AI source. Results are grouped by classifier type and ordered by F1 score within each group.

Classifier	Train Test	Precision	Recall	F1 Score
SVM	Suno Suno	0.995	0.995	0.995
	Udio Suno	0.972	0.972	0.972
	Udio Udio	0.971	0.971	0.971
	Suno Udio	0.795	0.673	0.629
RF	Suno Suno	0.988	0.988	0.988
	Udio Suno	0.956	0.955	0.955
	Udio Udio	0.949	0.948	0.948
	Suno Udio	0.815	0.735	0.713
5‑NN	Suno Suno	0.985	0.985	0.985
	Udio Suno	0.943	0.940	0.940
	Udio Udio	0.936	0.934	0.934
	Suno Udio	0.834	0.788	0.778

7 Effects of Audio Filtering on Detection

Given the success we observe in our experiments, we are motivated to seek the features of the data that are indicative of AI music. Our hypothesis is that music generated by Suno and Udio is not somehow deficient in quality when compared to human‑made music, but instead the success of their identification relates to some acoustic signatures that are specific to their audio‑generation pipelines. If our classifiers are heavily reliant on features of a particular frequency band, then their behavior will change when we attenuate that band. We thus apply low‑pass and high‑pass filtering to our test set using a fifth‑order digital Butterworth filter with cut‑off frequencies ranging from Hz to kHz. We compute the CLAP embeddings of the modified test set, classify them with our previously trained classifiers, and finally compute the F1 score as a function of cut‑off frequency. We also observe how the baseline models respond to these transformations.

The CLAP embeddings might be robust against such transformations, and so we first explore that possibility. We add a kHz, dB tone to Suno and Udio in randomly selected ‑second audio clips. With these, we retrain a simple SVM as well as the hierarchical classifier with SVMs at their nodes. We see that the F1 score remains very low () for cut‑off frequencies below kHz, begins to rise sharply between kHz and kHz, and reaches nearly 1 for cut‑off frequencies above kHz. This is precisely the behavior we can expect if CLAP embeddings preserve frequency domain characteristics. This also shows that our intervention pipeline works as expected.

Figure 3 shows how the F1 score of each classifier is impacted by filtering of the test set. We see that, as the low‑pass filter cut‑off frequency decreases, the performance of all three classifiers decrease. This suggests they are reliant on features in a high frequency band to differentiate between AI and non‑AI samples. In the case of high‑pass filtering, we see a decrease in performance when the cut‑off frequency exceeds only a few hundred Hz. This suggests that there are low‑frequency features also being used by the classifiers.

Impact of low‑pass (left) and high‑pass (right) filtering on the test set. This figure shows the F1 scores for three classifiers (support vector machine, random forest, 5‑NN) as a function of cut‑off frequency.

How is the performance of each classifier impacted by filtering for each source? Figures 4 and 5 show how such filtering impacts classifier performance for Udio samples and Suno samples, respectively. Low‑pass filtering does not seem to greatly impact the labeling of these samples as AI, even when excluding nearly all acoustic information. This suggests there to be features at very low frequencies that are indicative of the AI class. On the other hand, the impact of high‑pass filtering shows that, even when removing these low‑frequency features, the classifier performance is not greatly impacted until the cut‑off frequency exceeds kHz. This suggests there to be features below this cut‑off frequency that are indicative of the AI class. Hence, features indicative of the AI class appear to have identifiable characteristics in the frequency range kHz.

Impact of low‑pass (left) and high‑pass (right) filtering on the *Udio* test set. This figure shows the F1 scores for three classifiers (support vector machine, random forest, 5‑NN) as a function of cut‑off frequency.

Impact of low‑pass (left) and high‑pass (right) filtering on the *Suno* test set. This figure shows the F1 scores for three classifiers (support vector machine, random forest, 5‑NN) as a function of cut‑off frequency.

Figure 6 shows the false‑positive rate of each classifier for the MSD subset of the test set. This shows that an MSD sample can be confused as being AI‑generated by our classifiers if it is low‑pass–filtered with a low cut‑off frequency or high‑pass–filtered with a cut‑off frequency around kHz.

Impact of low‑pass (left) and high‑pass (right) filtering on the *Million Song Dataset* test set. This figure shows the false‑positive rate for three classifiers (support vector machine, random forest, 5‑NN) as a function of cut‑off frequency.

We now look at how the IRCAM Amplify detector is impacted by these filtering transformations. Because this model can only be queried a fixed number of times, we randomly select a subset of five samples per source from the test set for testing. We also test the impact of a sampling rate conversion on AI detection. Without transformation, this model shows perfect performance. The results from these 15 samples can be found on our dedicated website.^¹⁹ We observe the following results from the transformations: high‑pass filtering () kHz) makes IRCAM falsely label all audio (even MSD) as AI, while low‑pass filtering ( kHz) causes it to miss all AI music. Increasing the low‑pass filter cut‑off frequency to kHz causes it to misclassify all Suno samples and two of the five Udio samples. All samples from MSD are still correctly classified with these low‑pass transformations. Confidence levels for these classifications vary from around up to .

We also test the sensitivity of the baseline models to sampling rate conversion. When we reduce the sample rate of all 15 samples to kHz, IRCAM Amplify misclassifies two of the MSD samples, three of the five Udio samples, and all of the Suno samples. Confidence levels range between and . Resampling to a rate of kHz, however, has no impact on the perfect performance of IRCAM Amplify. This suggests a curious sensitivity of IRCAM Amplify on sample rate for identifying AI music from these sources.

8 Discussion

In this paper, we explore the timely topic of detecting whether or not a music recording is generated by AI. We focus on two AI music platforms, Suno and Udio, and build a novel dataset consisting of Essentia descriptors and CLAP embeddings of recordings that we downloaded from each platform. To represent non‑AI music, we randomly selected recordings from the MSD. Our statistical analyses of features extracted from this dataset show how the music recordings generated by Udio appear closer to the recordings from MSD than do those generated by Suno. We then explored the detection of AI music using our dataset and found the task easy to address using basic approaches to machine learning. We found our detection systems’ precision exceeded , comparable to that of the IRCAM Amplify detector and exceeding that of SpecTTTra. Testing on AI‑generated tracks from the out‑of‑sample source Boomy, however, yielded significantly worse results for all detection systems. Finally, we found that specific frequency bands ( Hz or kHz) impact detection, and that simply resampling audio to kHz can change the outputs of all detectors.

Given that the quality of the music being generated by Udio and Suno is becoming competitive with some commercial music (Goldmedia, 2024), we found the observed success of AI music detection to be surprising at first. We do not know precisely what information these detection systems are exploiting to perform so well, but our experiments suggest they may be picking up on subtle audio artifacts introduced in the generation process. Indeed, auditioning the output by Suno shows them to generally suffer from acoustic phasing that is difficult to remedy.^²⁰ Audio recordings output by Udio seem to have much higher fidelity in contrast.

The results of our experiments also might be influenced by structure in our dataset. Both Suno and Udio generate output at kHz with different bit rates: Suno at kbps and Udio at kbps. Of the audio data we downloaded for MSD from YouTube, has a kHz sampling rate, with the rest sampled at kHz and bit rates averaging kbps. This structure allows for very simple means for detection; for example, a simple rule based on the sampling rate being kHz could produce high precision (). The CLAP model resamples audio to kHz to standardize sampling rates, but bit rate differences may still introduce compression artifacts in the embeddings. These artifacts could create spurious correlations, allowing detectors to distinguish AI music without relying on musical features. Such spurious correlations align with findings in shortcut learning (Geirhos et al., 2020) and raise concerns about model robustness, as causal inference principles (Peters et al., 2017) highlight that such shortcuts may fail under distribution shifts (e.g., adversarial bit‑rate manipulation). Some AI music may also be easier to detect due to upsampling artifacts. Based on earlier versions of Suno,^²¹ we suspect that newer models may generate audio at a sampling rate of kHz, which is then upscaled to kHz. This upscaling process could introduce identifiable artifacts, such as low fidelity in high frequencies, which may help classifiers identify AI‑generated music. Although we could standardize our data to match the sampling rate and bit rate, differences in platform processing pipelines could still leave these artifacts intact. Thus, we chose not to standardize our data. Since all Suno and Udio music audio comes from distinct processing pipelines, and all MSD music audio comes from a wide variety of unique mastering chains, the high performance of any detector trained and tested on our dataset is not at all surprising. A valid conclusion to make from our experiments is not that our detectors are reliably detecting AI music but instead whether a given audio file has characteristics consistent with the production pipeline of Suno, Udio, or something else. This aligns with the findings of Afchar et al. (2024).

The method employed in this study presents several limitations that should be acknowledged. One notable issue is the choice of feature representation. Although CLAP is becoming standard for audio, its design affects classifier performance. While our findings suggest the presence of subtle signal‑level artifacts in AI‑generated audio, it is important to note that these results may reflect the feature‑representation capabilities of the CLAP model rather than inherent characteristics of the AI‑generated audio. The observed sensitivity to specific frequency bands could stem from how CLAP encodes audio features, which may not directly correspond to artifacts introduced by the generation process. This distinction underscores a common challenge in explainable AI: disentangling model behavior from underlying data patterns. Future research should explore other embedding models to test the consistency of these findings.

There is also a generalization problem, known as the Collingridge dilemma (Collingridge, 1980), where early‑stage methods may work well in controlled environments but struggle to adapt when applied to new data or scenarios. This uncertainty raises concerns about how well the method will adapt to AI music from other sources, especially as models evolve and become more sophisticated. Our cross‑domain evaluation results provide a concrete example of this dilemma. While classifiers achieve near‑perfect performance when trained and tested on the same source (F1 scores ), they decrease significantly when tested across different AI music systems, particularly from Suno to Udio (F1 scores dropping to ). This asymmetric generalization, where models trained on Udio transfer well to Suno but not vice versa, demonstrates how models that perform well in controlled settings may struggle with emerging AI music systems.

This generalization problem also reflects the constantly shifting environment in which AI music detection is to be deployed. While one solution may work for particular systems at specific times, these systems are in constant development, and so AI music detectors must similarly adapt to be able to identify the outputs of updated systems. This is somewhat akin to an ‘arms race’ where systems are built to evade detection, and detection systems are adapted to particulars of the new systems.

Finally, describing music as ‘AI‑generated’ or not is ambiguous, as music can blend human and AI tools, making such binary classification reductive. Contemporary music production heavily relies on digital tools, synthesizers, and AI‑assisted processes for tasks such as mixing, mastering, and composition. Many commercial recordings incorporate both human performance and AI‑enhanced elements. This complexity extends beyond mere technical considerations to philosophical questions about creativity and authenticity in musical expression (Boden, 1998). Hence, the question of how much AI must there be in a piece of music before the ‘AI music’ label applies deserves careful attention.

9 Future Work

Future work should test larger datasets and more transformations to understand how detectors generalize to new AI platforms. Moreover, closed‑source systems like YouTube Content ID,^²² Believe’s AI Radar,^²³ and Audible Magic^²⁴ remain untested here, but future work could evaluate their efficacy via heuristic approaches (e.g., submitting borderline samples for automated flagging). Testing other AI platforms (e.g., Boomy) will reveal how well detectors generalize to new systems. A deeper analysis of the higher‑level musical forms and structures of AI music could provide insights into potential ways of identifying it. By examining aspects such as melody, harmony, rhythm, and overall compositional style, researchers could identify whether AI systems replicate musical forms present in their training data. However, when these training data consist of human‑created music anyway, the applicability of the ‘AI music’ label is even more contestable. This aligns with broader efforts to understand the artistic limitations and potential of AI systems as creative tools. Finally, one could analyze the textual metadata of our collection to look for emerging themes, trends, and behaviors of Suno and Udio users. This is similar to the work by Sanchez (2023), who investigated text‑to‑image–generation practices, as well as the extension by McCormack et al. (2024).

10 Reproducibility

The code for this paper is available at https://github.com/lcrosvila/ai-music-detection. The repository includes scripts for dataset preparation, feature extraction, model training, and performance evaluation. We will not publicly release the audio data or the extracted features until Suno and Udio respond affirmatively to our request for their permission. Thus, the repository we release provides a list of song IDs that one can use to find the songs on both platforms.

11 Ethics and Consent

The impacts of AI‑generated content are yet to be understood, but the development of such technologies, and the vast scales at which content can be generated, make it imperative to study them (Sturm et al., 2024; Pelly, 2025). As our analysis focused on music we scraped from the commercial platforms Suno and Udio, there are several ethical aspects we must consider. The scraping of public websites, even when permissible under the terms of a platform (which, in this case, are not clear), raises questions about the fair use of such collected digital content, particularly when the data are used for research purposes. As we are a publicly funded research group focused on understanding the nature of AI music and its broader implications, we are permitted to collect such data and study them under the Copyright Directive (EU) 2019/790 (European Union, 2019). This is addressed in more detail in the supplementary material.

Acknowledgments

This paper is an outcome of a project that received funding from the European Research Council (ERC) underthe European Union’s “Seventh Framework Programme (FP7‑2007‑2013)” or “Horizon 2020 research and innovation programme” (MUSAiC, Grant Agreement No. 864189).

Contributions

L. Cros Vila performed all data collection, all the experiments, most of the writing, and investigated the legal aspects of this work. B. L. T. Sturm guided the work; provided extensive feedback on the variety of experiments, results, and conclusions; and conducted extensive editing of the manuscript. L. Casini provided remarks on the implementation of the experiments and feedback on the manuscript. D. Dalmazzo provided help with the ideation of the article and feedback on the manuscript.

Competing Interests

The authors have no competing interests to declare.

Additional File

The additional file for this article can be found using the links below:

Supplementary Material

Legal and Ethical Considerations for our Study of Music Generated by Suno and Udio. DOI: https://doi.org/10.5334/tismir.254.s1.

Notes

[1] An AI‑generated song reached the charts in Germany in 2024 https://www.musicradar.com/news/first-ever-ai-chart-germany, last accessed Apr. 23, 2025.

[2] https://musically.com/2023/06/08/figaro-will-help-streaming-services-deal-with-ai-generated-music, last accessed Apr. 23, 2025.

[3] https://newsroom-deezer.com/2025/01/deezer-deploys-cutting-edge-ai-detection-tool-for-music-streaming/, last accessed Apr. 23, 2025.

[4] https://www.ircamamplify.io, last accessed Apr. 23, 2025.

[5] https://suno.com, last accessed Apr. 23, 2025.

[6] https://www.udio.com, last accessed Apr. 23, 2025.

[7] https://www.ircamamplify.io/product/ai-generated-music-detector, last accessed Apr. 23, 2025.

[8] https://www.ircamamplify.io/blog/why-our-ai-music-detector-stays-ahead, last accessed Apr. 23, 2025.

[9] https://suno.com/playlist/cc14084a-2622-4c4b-8258-1f6b4b4f54b3, last accessed Apr. 23, 2025.

[10] https://github.com/renesemela/lastfm-dataset-2020/raw/refs/heads/master/datasets/lastfm_dataset_2020/lastfm_dataset_2020.db, last accessed Apr. 23, 2025.

[11] Note that YouTube links provided on the Last.fm website may lead to incorrect or incomplete tracks.

[12] https://github.com/yt-dlp/yt-dlp, last accessed Apr. 23, 2025.

[13] https://essentia.upf.edu/streaming_extractor_music.html, last accessed Apr. 23, 2025.

[14] https://github.com/MTG/essentia/tree/master/src/essentia/utils/extractor_music, last accessed Apr. 23, 2025.

[15] https://huggingface.co/lukewys/laion_clap/blob/main/music_audioset_epoch_15_esc_90.14.pt, last accessed Apr. 23, 2025.

[16] https://is-this-ai.github.io/visualizations/umap_plots.html.

[17] https://boomy.com, last accessed Apr. 23, 2025.

[18] Since the time of running these experiments and writing and revising this manuscript, the IRCAM detector has apparently been trained to detect Boomy: https://www.ircamamplify.io/blog/ai-music-detector-now-detects-boomy, last accessed May 23 2025.

[19] https://is-this-ai.github.io/visualizations/audio_analysis_results.html, last accessed Apr. 23, 2025.

[20] See this Reddit discussion from June 7, 2024, on various artifacts users identify with Suno‑generated material https://www.reddit.com/r/SunoAI/comments/1dals3d/audio_qualitynot_ready_yet/, last accessed Apr. 23, 2025.

[21] See the bark model on GitHub: https://github.com/suno-ai/bark, last accessed Apr. 23, 2025.

[22] YouTube Help. How Content ID works. https://support.google.com/youtube/answer/2797370, last accessed Apr. 23, 2025.

[23] Music Business Worldwide. Believe has developed its own AI‑made music detector with 98% accuracy. What might this mean for the future? https://www.musicbusinessworldwide.com/believe-has-developed-its-own-ai-made-music-detector-with-98-accuracy-what-might-this-mean-for-the-future1/, last accessed Apr. 23, 2025.

[24] Audible Magic. https://www.audiblemagic.com/, last accessed Apr. 23, 2025.