Data‑Driven Analysis of Text‑Conditioning in AI‑Generated Music: A Case Study with Suno and Udio

Luca Casini; Laura Cros Vila; David Dalmazzo; Anna-Kaisa Kaila; Bob L.T. Sturm

doi:10.5334/tismir.273

1 Introduction

The application of artificial intelligence (AI) to creating text, graphics, and music has expanded from academic research into the commercial realm, with powerful generative tools now available publicly. While obviously central for large language models like ChatGPT (Bubeck et al., 2023), text has become the main interface for other modalities like images following the success of generative models like OpenAI’s DALL‑E (Ramesh et al., 2022) and Stable Diffusion (Rombach et al., 2022). Textual conditioning inputs are commonly referred to as ‘prompts,’ and the idea of ‘prompt engineering’ is emerging around shared best practices (Oppenlaender et al., 2024). AI‑generated music (AI music) is starting to appear in the cultural mainstream, e.g., charting in Europe (Klevjer, 2024; Kriisa, 2024; Simpson, 2024) and appearing in marketing campaigns (Åkestam Holst, 2024). Professional music creators are also integrating AI into their workflows. A 2024 survey (Goldmedia, 2024) shows that 35% of GEMA (Gesellschaft für musikalische Aufführungs‑ und mechanische Vervielfältigungsrechte [Society for musical performing and mechanical reproduction rights]) and SACEM (Société des auteurs, compositeurs et éditeurs de musique [The Society of Authors, Composers and Publishers of Music]) members report using AI technologies in their work, reaching 51% for those under 35 years old.

There are many AI‑powered systems for music creation from companies like Google, Meta, Microsoft, OpenAI, Stability AI, and Spotify and startups like Aiva Technologies, Boomy, Riffusion, Suno, and Udio. Suno recently announced that its users create about seven million music audio recordings every day (Robinson, 2025). Given the popularity of these platforms, how can we study their use from the perspective of the text‑to‑music interaction modality? What large‑scale patterns can we observe? What can we learn about the users of these platforms and their goals? This paper focuses on analyzing AI music from the platforms Suno and Udio, as they represent the most popular services offering text‑to‑music generation, with hundreds of thousands of users (Yang, 2025) and active communities on social media platforms like Reddit and Discord.^¹

We approach these musicological questions with a data‑driven methodology, using tools from natural language processing (NLP) and data science. This methodology allows one to manage the unique nature of this new music practice by shifting from in‑depth analysis of individual works to a broader, data‑driven exploration of music produced by a broad community, thereby uncovering large‑scale trends in AI music. We start by collecting a large dataset of $101, 953$ songs generated by $60, 342$ users of Suno and Udio from May to October 2024. This dataset consists of textual prompts, lyrics, and tags associated with the music generated by these platforms. We process the dataset and analyze it in order to answer the following six research questions:

What is the distribution of languages used for the textual inputs?
What characterizes prompts?
What characterizes lyrics?
What characterizes tags?
How are mentions of real artists used by the users?
What kind of additional instructions (metatags) are included in the textual input for lyrics?

Other questions are possible to ask, but the limited length of this publication restricts the number we can answer here. We discuss some of these possibilities toward the end.

Despite the growing prominence of AI music– generation platforms, there has been no systematic analysis of how these systems are actually being used in practice. This article addresses this lack and makes three contributions to the emerging field of ‘AI music studies’ (Sturm et al., 2024) First, we develop and apply a new methodology for the study of AI music from text‑controlled systems based on textual features. We do this by combining well‑established techniques and state‑of‑the‑art tools from related fields and putting them to the test for this specific domain. Second, we share our code along with a list of URLs for the songs in our dataset to facilitate reproducibility, comparative studies, and other research.^² Finally, we produce interactive visualizations for exploring and navigating our dataset.^³

The article is structured as follows: Section 2 provides background information on lyrics and prompt analysis. Section 3 describes the interaction modality of these platforms and explains how they integrate metadata in the generation process. Section 4 details the methodology of our study. Section 5 describes the dataset we collect and use for our analyses. Section 6 illustrates the results we obtain from our analyses of our dataset. Section 7 contains a discussion of the results and points to further work. Finally, Section 8 concludes the paper.

2 Related Work

We first review work done in prompt analysis, particularly in the context of text‑to‑image generation. Then, we present research on lyrics analysis, focusing on computational methods for extracting meaning, sentiment, and stylistic patterns from song lyrics.

2.1 Prompt analysis

To the best of our knowledge, there is no work that studies prompting in the context of AI music generation. The closest is perhaps the SONICS dataset (Rahman et al., 2024), which is proposed as training data for AI music detection. It contains $\sim$ $49, 000$ songs generated with Suno and Udio using a template‑based approach to creating prompts and lyrics. It is not representative of how Suno and Udio are being used and so cannot be used to answer any of our research questions. We thus turn to similar work in the context of image generation.

A notable example is the work by Sanchez (2023) investigating text‑to‑image generation practices by looking at DiffusionDB (Wang et al., 2022), a dataset of 14 million real images and prompts from real users of Stable Diffusion. Their analyses result in a taxonomy of specifiers found in the prompts and clusters to find macro‑categories, along with a classifier trained to follow it. An extension of this work comes from McCormack et al. (2024), who include data from Midjourney and provide some additional clustering and topic modeling. Both papers make use of the HDBSCAN clustering algorithm in conjunction with UMAP dimensionality reduction.

2.2 Lyrics analysis

In MIR tasks such as mood and genre classification, lyrics and other textual features have been analyzed using NLP techniques (Mayer and Rauber, 2010, 2011). In the context of genre classification, Logan et al. (2004) demonstrate how to measure song similarity using NLP techniques applied to lyrics. Fell and Sporleder (2014) use a number of specialized linguistic features to address genre classification, rating prediction, and publication time prediction. Pyrovolakis et al. (2022) use a mix of NLP models and audio‑signal processing techniques to create song mood classifiers. Similar techniques can also be used to investigate sociocultural questions, moving beyond classification and retrieval tasks. For example, Varnum et al. (2021) looked at the decreasing linguistic complexity of lyrics in six decades of popular American music, using compressibility as a proxy, and studied correlation to ecological, social, and cultural factors in order to explain it. Napier and Shamir (2018) apply sentiment analysis techniques to lyrics of popular music since the 1950s and show how certain feelings like anger and disgust have increased significantly, while feelings like joy and openness have declined. In their analysis of lyrical complexity, structure, emotion, and popularity across five decades, Parada‑Cabaleiro et al. (2024) similarly demonstrate an increase in negative emotions, as well as a decrease in both lexical and structural complexity over time.

3 The Interaction Modality of Text‑to‑Music Generation Systems

We now provide a quick description of how users of Suno and Udio create music using text. Figure 1 shows the music‑creation interface in Suno, where a user can click on ‘Create’ in the sidebar to generate a song. This will show a textbox for prompting the model. Suno will generate the lyrics and style tags according to the prompt, which is used to condition the model. Users can also specify with a radio button if they prefer an instrumental track. An additional radio button labeled ‘Custom’ can be toggled, offering direct control over lyrics and style prompts. While the interface suggests adding tags that are added as comma‑separated elements, the textbox also accepts free text.

*Suno*’s interface in ‘Custom’ mode. The user can manually input lyrics and tags (style of the music). Last accessed December 16, 2024.

Figure 2 shows an example of Udio’s interface. The interaction happens largely in the same way as that with Suno. The user is first asked to describe the song they want, and Udio suggests tags. There is a button to fill this part with a randomly generated prompt. After that, lyrics can be written manually, automatically, or skipped to generate an instrumental song. The interface suggests pressing the ‘/’ key to show a list of popular meta‑tags that are added to the lyrics between square brackets and allow for additional control.

Udio’s interface (as of February 18, 2025). Custom mode allows for manual adjustments even if the prompts are automatically generated.

The text fields in the interfaces accept any Unicode input, including emojis and non‑Latin characters. At the time of writing, new functionalities are being added to the interfaces of both platforms (e.g., audio prompting, new models), but we do not consider any additions beyond October 2024.

4 Methodology

We now describe our methodology, drawing on similar work investigating users of AI image–generation platforms (McCormack et al., 2024; Sanchez, 2023). Our method can be summarized in six steps.

We scrape the regularly updated playlists from Suno and Udio’s ‘new songs’ public‑facing homepages.
We extract the metadata (prompt, lyrics, and tags) from these playlists and cast it as a dataframe consisting of rows (songs) and columns (metadata). This allows us to filter the rows based on content (e.g., language, length, null values).

We perform separate analyses on prompts, tags, and lyrics. For any of those data types:

3. We embed it using a pretrained model such as NV‑Embed‑v2 (Lee et al., 2024).
4. We perform dimensionality reduction and clustering on the embeddings.
5. For each of the clusters, we derive a label using an automated procedure and manually refine.
6. We create a two‑dimensional scatterplot of the clustered elements for visual analysis.

We iterate steps three to six multiple times, experimenting with the hyperparameters for dimensionality reduction and clustering.

In more detail, we perform scraping of the platforms’ ‘new songs’ playlists, extracting metadata from the HTTP request we find on each platform’s homepage. These playlists represent the most systematic and least‑biased proxy for randomly sampling the data on the platforms. While no information is available on how these playlists are populated, we assume they are periodically created with minimal supervision, if any, and users do not opt‑in explicitly. We save the metadata in JSON format for each song in a file that is refined into a dataframe for each platform. Some of the analyses we perform require filtering the dataset. We remove some entries only in those specific cases. Specifically, this means ignoring languages other than English from both the lyrics and prompt subsets, removing entries that are too short or too long, and dropping empty rows from the dataframe. We use a language classification model to filter out non‑English elements because they could prove problematic for English‑tuned NLP tools. We choose the classification model included in the fastext library from Meta.^⁴ The model outputs labels for languages according to the ISO 639‑3 standard,^⁵ together with an indication of the alphabet used.

To generate embeddings for clustering, we use NV‑Embed‑v2 by NVIDIA (Lee et al., 2024), a state‑ of‑the‑art deep learning model specifically designed for textual embedding. This model uses decoder‑only transformers along with a modified attention mechanism and ad‑hoc contrastive learning pre‑training to learn text embeddings, outperforming previous methods based on masked language models. We employ the implementation made available by the authors through the Huggingface library.^⁶

The latent representation produced by NV‑Embed has a dimensionality of $4096$ . We use dimensionality reduction as an intermediate step before clustering. Our choice for this task is UMAP (McInnes et al., 2018), which works by building a high‑dimensional graph with weighted connections that is reduced to a smaller‑dimensional graph while trying to preserve its structure. We employ the Python implementation provided by the umap‑learn package.^⁷ UMAP has a number of important parameters that influence the final result. The number of approximate nearest neighbors when building the high‑dimensional graph regulates the trade‑off between global and local structure, and this in turn influences the outcomes of clustering in our pipeline. The minimum distance between points in the reduced dimensional space directly affects how closely the algorithm packs together similar points. At clustering time, this can lead to too many clusters if set too low, while, at visualization time, it can result in small clusters being too far away from the rest, making the plot difficult to read.

For clustering, we use HDBSCAN (Campello et al., 2013), which is an extension of the DBSCAN algorithm that enables robust identification of clusters with varying densities (Ester et al., 1996). This clustering algorithm is recommended by McInnes et al. (2018), in conjunction with UMAP, when operating in high‑dimensional embedding spaces. One benefit of HDBSCAN compared to other clustering algorithms is the automatic detection of outliers. We use the implementation in the scikit‑learn package.^⁸ The algorithm has hyper‑parameters for the minimum and maximum cluster sizes, as well as a factor $ϵ$ that controls how likely clusters are to be merged together. We set these values for each analysis through trial and error, aiming for the best combination of hyper‑parameters that gives a meaningful number of clusters.

To facilitate exploration, especially in the case of lyrics with a high number of clusters, we devise a strategy to automatically name them. After forming clusters of tags, we use a bag‑of‑words strategy by counting term prevalence and then select the $N$ most common ones. We find that $N = 3$ is enough to topically characterize clusters. For lyrics and prompts, we remove stop‑words, perform lemmatization, and finally select the top three words after TF‑IDF ranking using the spacy package. We manually inspect clusters and then possibly rename each with more descriptive labels. We find, however, that the top word often provides a good characterization of the cluster. Additionally, we assign tags to a set of macro‑categories by either using a list of genres or instruments or by running regular expressions on specific technical terms (e.g., BPM). This influences dimensionality reduction and clustering, making the final clusters more informative and reflective of higher‑level categories.

5 Dataset

Our dataset consists of a total of $101, 953$ songs ( $20, 519$ from Udio and $81, 434$ from Suno) generated by $60, 342$ users that we collected between May and October 2024.

Table 1 summarizes the columns present in the metadata JSON object we have relevant for our research questions.

Table 1

The subset of the metadata from Suno and Udio we analyzed.

Field	Suno	Udio
`id`	Unique song ID	Unique song ID
`title`	Title of the song	Title of the song
`tags`	Tags derived from the user input	User tags
`replaced_tags`	N/A	Dictionary with tags replacements
`lyrics`	N/A	Song lyrics
`prompt`	Song lyrics	User input
`gpt_description_prompt`	User prompt for lyrics generation	N/A
`optimized_prompt`	N/A	Refined user input

There are inconsistencies around the actual meaning of prompt and tags due to differences between the two platforms. In the context of Suno, ‘prompt’ means any text input in the lyrics textbox before having it generate a song (Figure 1). This text can be written by a user or generated by an LLM prompted with a user’s description, stored as gpt_description_prompt. In Suno, what gets labeled as tags in the metadata comes from a textbox labeled in the interface as Style of Music, with a popover that invites the user to: ‘Describe the style of music you want (e.g., acoustic pop). Suno’s models do not recognize artists’ names but do understand genres and vibes.’ Although suggestions are provided, resulting in comma‑separated keywords if clicked, the string does not appear to be post‑processed and often results in individual tags. The way that these are stored is reflected in the interface too. Udio offers a textbox for a general prompt, which is then translated into tags by the system, unless Manual Mode is enabled. Additionally, a number of suggested tags is provided, which updates as the user selects them or writes in the textbox. The post‑processing of tags is most notable when the name of a real artist is mentioned; these get replaced with generic tags, as opposed to Suno, which just blocks their use. The metadata contains the original prompt and tags, as well as their post‑processed version, if any. The interface has a section for lyrics, which can be either AI‑ or user‑generated. In both cases, they are stored as lyrics without any indication of their provenance. The optimized prompt field for Udio is used inconsistently. When present, it seems to overlap with the replaced tags field.

In the context of this paper, we use the following terminology in order to clear the ambiguity around certain fields of our metadata:

lyrics: prompt in the case of Suno and lyrics in the case of Udio;
tags: tags and their replaced/optimized version in the case of Udio and high‑level labels for each song; and
prompt: gpt_description_prompt in the case of Suno and prompt in the case of Udio.

6 Results

We now discuss the specifics of several analyses and our results.

6.1 Language detection

Table 2 shows the $15$ most prevalent languages in lyrics on both platforms, accounting for more than 90% of the dataset. For each language, prevalence in each platform is given as a percentage, with its rank depending on the sum across platforms. For both services, we see lyrics written in a variety of languages, with English being by far the most prevalent (this is expected, as both companies are based in the United States). The tail of the distribution features major European and Asian languages. These findings hold for prompts as well. The differences in the use of certain languages might be explained by different release patterns on the global market and Udio’s smaller user‑base, likely due to its more recent establishment.

Table 2

The 15 most popular languages for lyrics in the dataset. The percentage refers to the prevalence in each platform.

Language	ISO 639‑3	Udio	Suno
English	`eng_Latn`	71.39%	46.75%
German	`deu_Latn`	3.68%	8.87%
Russian	`rus_Cyrl`	2.99%	6.68%
Spanish	`spa_Latn`	3.28%	4.58%
Portuguese	`por_Latn`	1.68%	3.55%
Korean	`kor_Hang`	3.21%	3.00%
Chinese	`yue_Hant`	1.77%	3.33%
Indian	`ind_Latn`	0.27%	3.26%
French	`fra_Latn`	1.81%	2.15%
Japanese	`jpn_Jpan`	1.45%	1.92%
Turkish	`tur_Latn`	0.29%	1.66%
Italian	`ita_Latn`	1.06%	1.29%
Thai	`tha_Thai`	0.05%	1.26%
Vietnamese	`vie_Latn`	0.09%	1.11%
Polish	`pol_Latn`	0.77%	0.94%
TOTAL		93.79%	90.35%

6.2 Prompts

In our dataset of $101, 953$ songs, we find $41, 108$ that have a prompt: $20, 589$ from Suno and $20, 519$ from Udio. The median length is around $80$ characters, but, for Udio, there is a heavier tail of long prompts pushing the $9 9^{t h}$ percentile to $1, 325$ characters. Using this number as a threshold, we remove $206$ prompts based on their length. To avoid distorting the results because of different languages and repeated prompts, we filter out any text in languages different than English and remove duplicates. This results in a subset of $16, 881$ prompts.

We use NV‑Embed‑v2 to embed the prompts and reduce the resulting embedding to five dimensions using UMAP with n_neighbors $= 10$ and min_dist $= 0.15$ . We run HBDSCAN with a minimum cluster size of $20$ and maximum of $200$ , with $ϵ = 0.25$ , resulting in $81$ clusters. Figure 3 shows the resulting plot. For each cluster in the plot, its label was manually reviewed and revised based on the procedure described in Section 4.

Clustering of prompts embeddings. The names for each cluster are manually defined after checking their content.

Clusters form around shared keywords and semantic meaning, mostly building around genre‑ or instrument‑specific prompts. We also see a number of clusters built on themes like ‘pets,’, ‘coffee,’, ‘summer,’ or ‘stand‑up comedy.’ A few clusters have names between quotes, which are clusters that feature prompts where some details might change but the part between quotes is always present. For example, ‘An xxx song about when you’re not around.’

An interesting group is made of prompts that appear to be generated automatically, possibly by some external service.^⁹ We did not include these in Figure 3, as their high degree of similarity results in a skewed plot, but they are labeled ‘scripted’ in the interactive visualization.^¹⁰ As the name suggests, these prompts follow a pattern that features a theme, a name the song is dedicated to, some adjectives, and a genre. As a confirmation of their scripted nature, the prompts feature a string between curly brackets that was likely not replaced by a value, e.g., ‘I’m currently feeling hyped and I want to feel happy. My favorite genre of music is country. Please write the song in moodsGenre style.’

If we cluster prompts without removing duplicate tags, we also obtain a number of clusters containing only duplicates of the same prompt, which shows that some users publish multiple attempts at generating the same song. We can speculate that most of the generations on the platform are of this nature but, since they remain unpublished, we cannot confirm this.

Beyond semantic content, it appears that there are two main tendencies in the way prompts are built. On the one hand, we find prompts constructed like a list of comma‑separated qualifiers, with their length varying from a couple of words to more than 10 (e.g., ‘modern country,’ ‘contemporary folk,’ ‘introspective,’ ‘melodic,’ ‘bittersweet’). On the other, we have more literary prompts that describe the content of the song (e.g., ‘A song about . . . ’) or the sound of the song (e.g., ‘A jazz ballad with trumpet, etc., . . . ’). We can view these two as extremes on a spectrum, with most prompts being a combination of both approaches.

Prompts might sometimes contain references to real artists, with text like ‘A song in the style of . . . ’ or ‘Sung by . . . ’ It appears, however, that Suno uses the prompt only for generating the lyrics, while tags do not allow the use of real artists’ names. Udio is instead more flexible, extracting tags from the prompt automatically and, in many cases, allowing artists’ names. Section 6.5 contains a list of artists we find in the Udio subset. Overall, it appears that prompts contain a mix of information about themes and style. The former emerges more clearly from analyzing lyrics, while the latter is better observed in the tags.

6.3 Lyrics

In our dataset, we find $11.1 %$ (Udio) and $15.6 %$ (Suno) of songs are instrumental, and thus have no or few lyrics. We filter all lyrics that are not detected as English. Looking at the distribution of lyrics length shows that the $9 9^{t h}$ percentile is $3, 460$ characters for Udio and $2, 999$ for Suno. We thus remove from consideration English‑detected lyrics longer than $3, 600$ characters and instrumental entries, giving a total size of $42, 163$ lyrics that we analyze.

We combine song lyrics and song titles into a single string for analysis. Since lyrics can include structural information inside square brackets, e.g., ‘[chorus],’ we remove all such occurrences using a regular expression. Following McCormack et al. (2024), we reduce the embeddings of the lyrics to five dimensions using umap‑learn with the following parameters: metric = cosine, min_dist = 0.1, and n_neighbors = 15. We then apply the HDBSCAN in the reduced latent space. By iteratively inspecting clusters, we settled on the following hyperparameters: min_clustersize = 20, min_ samples = None, $ϵ = 0.0$ , and max_cluster_ size = None. The algorithm gives us $190$ clusters, as illustrated by Table 3, and a group of outliers of size $25, 794$ . This high number of outliers is not problematic since our goal is not to classify all the elements perfectly but to find meaningful anchors in the latent space with which to infer what each cluster represents.

Table 3

Macro‑categories (manually defined) and HDBSCAN clusters (manually renamed) in the lyrics‑embedding space with their respective sizes.

Category	Clusters
abstract (1719)	afrofuturism (26), afrofuturism (64), carpe diem (70), chaos (21), clashing (60), dreams (49), dreams (31), flying (199), spirituality (42), mask (24), mirrors (53), money (26), photo (35), post‑atomic (32), religion (854), shadows (48), tired (37), war (48)
animals (708)	animals (74), bees (38), birds (40), butterfly (23), capybara (33), cats (280), dogs (169), fireflies (27), frog (24)
celebration (232)	birthday (126), halloween (31), xmas (75)
daily life (345)	chores (23), daily work (68), monday (54), monotony (25), rent (37), school (109), weekend (29)
dance (515)	beat (31), beat (64), last (49), night (31), night (125), groove (125), heartbeat (29), moonlight (25), swing (36)
driving (210)	driving (83), road trip (47), speed (80)
family (241)	family (52), father (43), friends (87), mother (59),
fantasy (893)	demons (21), fantasy (562), shadows (24), spooky (117), vampire (29), vikings (83), werewolf (57)
feelings (809)	break free (109), fade away (26), good ol days (31), happiness (24), loneliness (131), madness (116), old place (46), pain (30), peaceful (28), revenge (23), run free (25), runaway (29), tears (22), weariness (114), yesterday (55)
food (784)	candy (60), cheese (522), coffee (101), fruit (101)
genre (817)	blues (60), country music (301), emo (195), guitar (52), heavy metal (29), rock ‘n’ roll (50), trap‑like (130)
location (679)	america (38), australia (21), beach (28), capitals (23), desert (58), earth (96), egypt (35), forest (65), river (26), sea (289)
love (2182)	apology (21), goodbye (85), heartbrake (172), i miss you (45), always (74), breakup (81), burning (37), can’t wait (52), crazy (29), distance (50), dream (31), electric (40), eyes (97), feel (41), forever (26), forever (406), hand (20), holding on (33), letting go (22), longing (80), loss (28), missing (27), stay (37), time (26), togetherness (41), unrequited (292), wait (33), wandering (38), whisper (76), whisper (47), return (95)
meme (412)	fck (183), memes (91), pp (118), weed (20)
mixed language (769)	chinese (93), hindi (23), indonesian (24), jamaican (86), japanese (355), korean (108), russian (52), spanish (28)
motivational (288)	new dawn (50), dreams (40), phoenix (59), rising (61), shine (21), unstoppable (57)
other (557)	alphabet (27), boots‑pub? (27), gpt glitch? (25), absurd? (25), poe‑raven (39), short+instr. (414)
politics (143)	palestine (29), protest songs (71), trump+biden (43)
sports (254)	sports (197), training (57)
stars/night (648)	cosmic (460), quiet night (39), stars (32), nightsky (117)
technology (899)	AI (663), code (110), crypto (40), digital (55), math (31)
time (336)	midnight (32), midnight (130), midnight+love (22), morning (38), sunset (84), time (30)
urban (595)	city (135), city (165), lost (22), neons (233), street (40)
videogames (126)	helldivers (20), pokemon (42), videogames (64)
weather/seasons (949)	autumn (51), frozen (39), moonlight (140), rain (62), rain+dancing (183), rain+love (49), rainy day (39), summer (211), sunshine (155), sunshine+love (20)
outliers (25794)	outliers (25794)

We manually refine the auto‑generated cluster names by listening to the songs closer to the centroid of each cluster and renaming the cluster in cases where we find the original name imprecise or ambiguous. We then manually group clusters into macro‑categories based on semantic similarities. We obtain $26$ macro‑categories, plus one for the outliers. Table 3 summarizes this result and Figure 4 shows the bi‑dimensional reduction of the embedding space, where each song is a dot colored according to the cluster whose name is printed near the centroid.

Clusters of lyrics (from both Suno and Udio) obtained from the HDBSCAN algorithm applied to a five‑dimensional UMAP reduction of the embedding space. The scatterplot is then created on a two‑dimensional reduction of the same space. Colors represent macro‑categories and text annotations refer to the specific clusters, as shown in Table 3.

We see certain types of use and users emerge from the clusters. As one might expect, the biggest group is songs with variations around the theme of love with many smaller clusters that capture different aspects like: unrequited love, breakups, distance, longing, etc. The biggest individual cluster, however, is the one with $854$ worship songs, where ‘love,’ ‘god,’ and ‘lord’ are the most prevalent words. In this cluster, Christianity and Islam seem to be the dominant subjects. There is a group of songs that are created to celebrate specific events like holidays or family members. This suggests that users sometimes use these services to generate bespoke songs for particular people on celebratory occasions, in relation to certain experiences like trips or sport events, or covering daily routines like school life. Songs about pets, animals, and food are also a recognizable category, reflecting a ‘musicalization of everyday life,’ as noted by Tan (2024). Contemporary geopolitical events are also seen. We find a cluster of political songs about the 2024 USA elections and candidates. Clusters for the Israeli–Palestinian conflict are also present, but, in the plot, we only see a cluster with Palestine‑related keywords. Songs with Israel‑related keywords are not absent from the dataset but rather they are mostly labeled using Hebrew text by the language detection and thus are not considered in our analyses. Not all traces of languages other than English are filtered out, however. A subset of users seem to be creating songs using bilingual lyrics. This is especially true for Asian languages, perhaps because of the popularity of J‑Pop and K‑Pop. Themes from fiction like high fantasy, horror and monsters, and post‑atomic worlds all appear with their clusters of songs, with the first being the largest. There are multiple small clusters about video games, with classics like Pokemon and recent releases like Helldivers. We find a thematic cluster of songs about technology. Especially considering AI, there seems to be a similar number of songs in praise of it and against it. In the meme category, we find clusters of songs featuring humorous, sometimes crass, lyrics or that feature words from popular online memes (e.g. ‘skibidi,’ ‘sigma,’ ‘Ohio’). In the ‘other’ group, we find some clusters that are peculiar and some that seem to be artifacts of the clustering algorithm (highlighted by a question mark in Table 3). In the former group, we have songs based on Poe’s poem The Raven, which appear to be related to a community‑organized challenge on the Suno Discord Server.^¹¹ An example of the latter is in the cluster labeled ‘gpt glitch,’ featuring lyrics that look like the output of a large language model, containing text such as ‘‘I’m sorry but I can’t create that for you.’

Manually inspecting the outliers class shows that most of them are spread among existing clusters and could reasonably be merged. For example, most outliers around the fantasy and war, feelings, and dace clusters share similar semantics. Songs with multiple themes seem to be labeled as outliers by HDBSCAN unless their prevalence is considerable like the different clusters in the ‘love’ macro‑class. Proximity can help discover distinct groups in outliers. For example, next to the other mixed languages clusters, there are undetected Arabic and Thai songs, as well as some featuring either ASCII art or emojis in the lyrics. On the left side, there is a very dense group of songs with the same lyrics, featuring the line ‘love is electric,’ mostly created by the same user (found with manual inspection).

6.4 Tags

Tags contain genre and style conditioning for the generation model and can provide some interesting insight into what is popular on the platforms, especially since they are often used as searchable keywords that help users explore the catalog. Upon inspection, tags from Suno appeared as mostly free text, while Udio’s tags appear mostly as comma‑separated values. We can hypothesize the difference in the metadata field might be due to different practices that arise in each user community, but it is worth noticing that it might also depend on the way tags are collected and processed, as well as the interface of each service, with Udio giving more prominence to clickable recommendations than Suno. Overall, we can see that Suno’s tag strings tend to be shorted when compared to a stringified version of Udio’s tag lists, with the averages being respectively $43.37 \pm 32.27$ and $115.63 \pm 108.20$ .

For this analysis, we look at metadata labeled as tags in both platforms without attempting to reverse‑engineer the way they are collected or created. We start by building a vocabulary of tags. In order to isolate elements in Suno’s tags string, we start by compiling a list of tags from Udio’s comma‑separated lists, obtaining $7656$ tags. Then, we check Suno’s data against the vocabulary. Rows that look like lists are split and combined with the vocabulary. The rest of the data are parsed for matches with the vocabulary, preferring matches with longer tags and skipping elements with no matches. This results in $18695$ tags, of which $24.32 %$ appear less than 10 times. While we count $24058$ unique tags, $76.02 %$ of them only appear once, as they are either very uncommon words or very long descriptive strings. Figure 5 shows a word cloud with the $200$ most common tags for each platform. From this point onward, we limit our analysis to tags that appear at least 10 times in the dataset, resulting in $1245$ tags. We avoid reducing the strings to word stems to preserve individual tags in the comma‑separated lists and avoid losing information from named entities like genre names.

Word cloud for Suno **(left)** and Udio **(right)**. Font size is scaled according to prevalence.

Clustering the tags directly results in groups that are not very informative, as semantic relationships seem to overpower functional ones. For example, guitar, rock, and aggressive would gravitate toward each other rather than be closer to instruments, genres, and adjectives, respectively. To address this, we create a high‑level taxonomy to guide the dimensionality reduction and influence the subsequent clustering. The first two columns in Table 4 show their names and prevalence. We match genre names to those given by Every Noise at Once.^¹² We do the same thing for musical instruments using the list from the Institute of Musical Instrument Technology website.^¹³ Using a part‑of‑speech tagger, we find single‑word adjectives in the remaining tags. The remainder of the set contains the other classes that we spot in the first clustering attempt, plus some outliers we sort out manually. We match voice tags to anything containing words like ‘voice,’ ‘vocal, or ‘singer.’ Similarly, structural tags are all those that contain keywords related to arrangement (e.g., ‘chorus,’ ‘intro,’ ‘fast,’ ‘instrumental’). ‘BPM,’ years (e.g., ‘60s’) and key signatures (e.g., ‘D flat major’) are all matched using simple regular expressions. We refine these classes manually after checking elements that remain unclassified and the clustering results. We end up with $85$ clusters, mostly following the macro‑categories we specify manually. Figure 6 shows the results of our clustering pipeline, where UMAP is conditioned on our high‑level taxonomy. The five‑dimensional UMAP uses n_neighbors $= 10$ and min_dist $= 0.05$ . HDBSCAN uses minimum and maximum cluster sizes of $5$ and $25$ and $ϵ = 0.15$ . The two‑dimensional UMAP uses n_neighbors $= 50$ and min_dist $= 0.1$ .

Table 4

Manually created high‑level taxonomy derived from tags used more than 10 times. For Suno and Udio separately we indicate the expected number of tags $λ$ and the probability of seeing more than one in a string according to a fitted Poisson distribution. UNDEFINED refers to tags that appear in a string but don’t match with our vocabulary.

Category	n. of tags	$λ_{s u n o}$	$λ_{u d i o}$	$P_{s u n o} (X > 1)$	$P_{u d i o} (X > 1)$
GENRE/STYLE	657	1.15e + 00	3.90e + 00	6.84e‑01	9.80e‑01
QUALIFIER	324	8.38e‑01	3.56e+00	5.67e‑01	9.72e‑01
INSTRUMENT	108	2.54e‑01	2.58e‑01	2.24e‑01	2.28e‑01
STRUCTURE	68	1.34e‑01	2.96e‑01	1.26e‑01	2.56e‑01
VOICE	51	1.27e‑01	6.08e‑01	1.19e‑01	4.55e‑01
YEAR	22	2.98e‑02	5.39e‑02	2.93e‑02	5.25e‑02
KEY	10	3.65e‑03	3.53e‑03	3.64e‑03	3.53e‑03
BPM	6	1.63e‑03	3.93e‑04	1.63e‑03	3.93e‑04
UNDEFINED	‑	−1.26e+00	1.15e+00	7.17e‑01	6.84e‑01

Clusters of the most common tags (combined ranking from both services). Colors correspond to macro‑categories defined manually. Text corresponds to the most prevalent tag in each cluster of the clusters we find with HDBSCAN. Grey circles indicate outliers.

Figure 6 shows that genres, qualifiers, and instruments make up most of the dataset and are clearly separated. Voice‑related tags also form two compact groups of notable size for male and female voices, respectively. We can see small, well‑defined clusters for year, BPM, and key specifiers. Structural tags are the most spread out, with some (e.g. beat) pushed towards clusters featuring related genres or concepts. Qualifiers are grouped into fewer and larger clusters, which appear to go from more abstract concepts (e.g., atmospheric) to more musical adjectives (e.g., melodic) that are closer to the genre clusters. Genres are more easily separated and seem to cluster according to the similarity of the associated styles. In some cases (e.g. country, hip‑hop), the cluster ends up very compact and separated, as each entry is a combination of the genre name and an adjective (black metal, thrash metal, etc.). Instruments cluster together in three separate areas, with the guitar cluster lying next to rock and country, the piano cluster being closer to jazz and instrumental genres, and the synth cluster being closer to electronic music. There are a number of elements that were not clustered and are represented in gray in Figure 6. Most of these outliers are quite prevalent tags that might, however, lie in between many different genres and concepts in the embedding space and are geometrically difficult to place (e.g., acoustic).

Using the taxonomy, we can study the distribution of tag types in our collection for a typical user interaction. For each row of our dataframe, we count how many instances of tags belong to each category and then fit a Poisson distribution. Columns 3–6 in Table 4 show the expected value for the distribution and the probability of seeing at least one tags for each category for Suno and Udio, respectively.

6.5 Real artist names in udio

Udio replaces real artist names in the tags with a set of generic descriptors, which might be a way for the company to avoid producing outputs that infringe intellectual property while retaining the ability for a user to describe what music they want. In the metadata, the replaced_tags field is structured like a dictionary with labels for the nature of the replacement, making it easy for us to identify replaced artists. In our collection of $20, 519$ Udio songs, we find $1000$ such instances corresponding to $703$ unique artists. Table 5 shows the $20$ most prevalent replaced artist names, along with the number of times they appear, and an example of the tags substituted for them. Udio appears to match different spellings of the same name (e.g., ‘Bjork’ or ‘Björk’), and the replaced tags are not always the same, suggesting a dynamic system behind the replacement and not just simply a lookup table.

Table 5

The 10 most replaced artists found in Udio’s metadata under replaced_tags. The full list contains 703 artists.

Artist	#	Replaced Tags
XXXTentacion	26	emo rap, alternative r&b, hip hop, contemporary r&b, r&b, pop rap, aggressive, self‑hatred, boastful, depressive
Drake	19	male vocalist, pop rap, contemporary r&b, hip hop, r&b, alternative r&b, atmospheric, introspective, apathetic, mellow, bittersweet
Taylor Swift	18	alt‑pop, singer–songwriter, synthpop, nocturnal, romantic, love, atmospheric, lonely, sentimental, longing, concept album, lethargic, passionate, 2020s
Foo Fighters	18	male vocalist, alternative rock, post‑grunge, acoustic rock, energetic, melodic
The Beatles	17	male vocalist, psychedelic pop, pop rock, psychedelia, sunshine pop, art pop, melodic, lush, love, fantasy, optimistic, dense, pastoral
Depeche Mode	17	male vocalist, synthpop, downtempo, ambient pop, electronic, melancholic, melodic, calm, soothing, lush, mellow, nocturnal
Adele	17	female vocalist, pop soul, adult contemporary, pop, blue‑eyed soul, passionate, sad, sentimental
J. S. Bach	16	classical music, baroque music
The Weeknd	14	male vocalist, alternative r&b, electropop, r&b, electronic, synthpop, nu‑disco, party, hedonistic
ABBA	14	female vocalist, europop, euro‑disco, dance, pop, disco, optimistic, energetic, uplifting, melodic, rhythmic, party, lush

6.6 Metatags in the lyrics

Lyrics often contain elements that are not words to be sung by the model. We refer to these elements as metatags, following an online guide compiled by a Suno user: https://www.suno.wiki. Their purpose appears to be steering music generation by providing additional information about characteristics like structure, instrumentation, delivery, and dynamics. Metatags mostly appear between square brackets, although they might also appear without special markings (e.g., verse followed by a new line) or using other symbols like parentheses. We limit ourselves to looking for square brackets, as this seems to be the dominant pattern. Some users appear to put lyrical content like call and response phrases or second voices inside square brackets. These drop to the bottom of our list since they only appear once. Table 6 contains the $25$ most common metatags and their prevalence in each platform.

Table 6

Top $25$ most prevalent metatags. In cases where numbers appear, e.g., Chorus 2, they were stripped and merged into one category.

`Sequence`	Suno	Udio	Total
`verse`	73722	14548	88270
`chorus`	48670	16672	65342
`bridge`	17826	4665	22491
`outro`	6163	2386	8549
`pre‑chorus`	3725	2914	6639
`end`	3747	289	4036
`intro`	2436	1171	3607
`instrumental`	2682	710	3392
`drop`	662	1403	2065
`guitar solo`	1212	835	2047
`hook`	912	468	1380
`break`	943	212	1155
`interlude`	538	398	936
`fade out`	644	208	852
`instrumental break`	504	280	784
`solo`	537	236	773
`instrumental solo`	633	39	672
`instrumental intro`	596	64	660
`breakdown`	260	277	537
`refrain`	314	203	517
`instrumental interlude`	452	55	507
`yeah`	3	466	469
`pre‑hook`	426	15	441
`build`	163	267	430
`final chorus`	144	255	399

Most of the metatags we find are used to suggest structure, with verse and chorus dominating the list. Metatags for repeated parts often come with numbers, e.g., Chorus 2, but we opted to remove them and just consider the main metatag. Words specifying the instrumentation for a section, like guitar solo or instrumental intro, are very common.

Some users go a step further by including multiple instructions in a single metatag. Looking at the least‑frequent elements in the subset, we find some unique and very elaborate metatags. A few examples of recurring elements are: chord progressions; precise duration; and specific instruction on instrumentation, dynamics, or tempo. However, as noticed in suno.wiki, long sequences of metatags can ignored by the AI model or misinterpreted as lyrics and be sung.

The heterogeneity in the way these sequences are used suggests this is an evolving practice, with users experimenting and trying to build shared knowledge of interacting with the systems. Udio also differs from Suno in that they actively encourage users to experiment by providing a tool‑tip with suggestions a user can summon using the ‘/’ key. We can speculate that this behavior could be exploited by Udio to ultimately improve the platform’s prompting capabilities by leveraging crowd‑sourced labels.

Like we see with tags, metatags sometimes include the names of real artists. This is especially important in Suno, where providing such names in the tags is forbidden. Table 7 contains a few randomly selected examples where the name of an artist is used in an attempt to evoke their voice or music style.

Table 7

A few examples of metatags that mention real artists.

Artist	Metatag
Bob Marley	`produced by bob marley and lee perry`
Journey	`journey separate ways synth arpeggio`
Kanye West	`verse 2: kanye west`
Madonna	`influence: madonna, michael jackson`

7 Discussion

We now discuss our work and point out its limitations. The principal goal of this article is to present the first systematic analysis of the users of AI music–generation platforms through their prompts and lyrics. We focus our analysis on two specific popular platforms: Suno and Udio. We aim to understand how users are engaging with these text‑to‑music AI platforms, where text is a principal mode of interaction to specify prompts, lyrics, and tags. To this end, we collect a large dataset of metadata from these platforms and build a processing pipeline and a set of interactive visualizations that allow us to identify and characterize prevalent semantic groups of prompts, tags, and lyrics.

The resulting picture of these users is of course limited by the six‑month time frame of our data collection—namely, May to October 2024. Certain trends might not be captured, and the prevalence of certain themes or tags can be due to their temporal context. For example, memes and geopolitical references shift continuously and sometimes very quickly. Another aspect that can be important for some users but is not captured by our dataset is the presence of different versions for each service. There can be interesting differences in the way different model versions respond to prompting, which in turn will be reflected in different prompting practices being used. Finally, our dataset consists only of songs published by users, which inevitably introduces a bias on the quality of songs in the dataset, as well as the prevalence of certain themes. Unfortunately, we see no way around this limitation.

Our focus on English is a clear limitation of this study, and so future work should look for trends and peculiarities in other languages. Cultural differences might emerge in the way users with different linguistic and national backgrounds engage with AI music generation. An interesting aspect to consider is how the distribution of languages on the platforms does not seem to follow the distribution of the most commonly spoken languages in the world. This might be due to asymmetries in the way the services were released and advertised across the world, as well as economic factors and internet access.

While the clustering methodology we employ is extremely effective at extracting information from such a large and heterogeneous collection, it ends up labeling many points as outliers. Manually inspecting these outliers, however, shows that there are undetected groups that could still be recovered by introducing a classifier trained on data that has been labeled and validated by humans. Future work might consider a multi‑step procedure that involves active learning to have total coverage. Additionally, this would allow continuous addition of new songs to the dataset and automatic labeling of them. While, in our work, we named clusters using a coarse procedure that required some manual refinement, a more sophisticated pipeline could be employed, possibly including large language models for naming the clusters.

While performing this research, we often found ourselves at methodological crossroads. On the one hand, we want to capture the most representative user behaviors and song themes in our dataset, giving special attention to establishing well‑defined clusters and analyzing the major ones. On the other hand, outliers stick out from the bulk of the data and contain interesting behaviors that can reveal something about the more nuanced inner workings of the platforms and the environment around them. We try to strike a balance in this article, but future work might focus on specific thematic or linguistic subgroups identifiable with our methodology.

Ethnographic studies can be carried out, targeting specific user groups and their unique practices. This kind of work is beginning to appear, e.g., Deruty et al. (2022), Herington et al. (2025), Kehagia and Moriaty (2023), and Torres et al. (2025). AI music communities can also be studied by integrating relevant data from Discord or Reddit into the analysis. General discussions in those forums might provide insight on how AI music is received and what strategies and practices are applied in the process and are considered the best. For example, studying challenges organized by community members can provide multiple songs with a common thematic baseline to be analyzed, helping highlight differences.

8 Conclusion

This article is the first to study textual metadata from a large dataset of AI music collected from two platforms offering text‑to‑music models: Suno and Udio. The dataset we collect is the first of its kind and allows us to look at the growing phenomenon of AI music. We draw from existing literature on AI image generation, and we adapt an established methodology to study prompt‑based systems. We focus on prompts, tags, and lyrics, as they are the main interface to these music‑generation systems and provide insight into the users of the platforms and how they use them. Our analyses identify clusters of themes in the data and highlight behaviorism and trends in the prompts and tags of the users. Due to uncertainty with regard to intellectual property rights, we only make available the list of URLs with which one can retrieve the data. Our novel methodology and data can be used by other researchers in the field who are interested in furthering this exploration. We believe that a number of relevant downstream tasks can be based on our data and methods. Ethnographic studies might target subgroups in the user base and studying specific aspects that we only touched upon like non‑English prompts and lyrics, while more technical research can benchmark new MIR pipelines on the large dataset. We plan on extending on this work by leveraging both audio and textual features and their interplay to support further analyses.

9 Reproducibility

We share a list of URLs to the songs we collected from both services but, for legal concerns, we do not publish our dataset with the song metadata or the code to scrape the playlists directly. The code to fetch the data from the URLs and to produce our visualization and analysis can be found at the following repository: https://github.com/mister-magpie/aims_prompts. The availability of songs is subject to change, as users can decide to delete their songs or accounts. At the time of submission, we observed that $4.59 %$ and $1.02 %$ of URLs from Udio and Suno, respectively, are not reachable.

10 Ethics and Consent

As AI‑generated media and the services used to produce it become more widespread, it is important to study it and its impact (Sturm et al., 2024). Since this technology exists in the public sphere, it should be studied in this real context with real users. Our research questions revolve around users and their use, and so we focus on observing AI music as users have created and published. Our collection of data from the Suno and Udio website are not permitted by their terms of service, but this does not mean this research is illegal or unethical. As we are a publicly funded research group focused on understanding the nature of AI music and its broader implications, we are permitted to collect such data and study it under the Copyright Directive (EU) 2019/790 (European Union, 2019). We attempted to contact both companies upon data collection but did not receive a response.

Acknowledgments

This paper is an outcome of a project that received funding from the European Research Council (ERC) under the European Union’s ‘Seventh Framework Programme (FP72007‑2013)’ or ‘Horizon 2020 research and innovation programme’ (MUSAiC, Grant Agreement No. 864189). A.K. Kaila is founded by the Wallenberg AI, Autonomous Systems and Software Program – Humanities and Society (WASP‑HS) funded by the Marianne and Marcus Wallenberg Foundation and the Swedish Research Council (2024‑01832).

Contributions

L. Casini designed and performed all the analyses in the paper and produced most of the text in the manuscript. L. Cros Vila collected the original dataset and provided feedback during data analysis and manuscript writing. A. K. Kaila and D. Dalmazzo provided help with the ideation of the article and feedback on the manuscript. B. L. T. Sturm supervised the work, providing feedback at every step and actively contributed to editing the manuscript.

Competing Interests

The authors have no competing interests to declare.

Notes

[1] As of December 16, 2025, Suno’s discord has $405, 210$ members and Udio’s discord has $17, 818$ .

[2] https://github.com/mister-magpie/aims_prompts.

[3] An interactive version is available online at https://mister-magpie.github.io/aims_prompts/.

[4] https://huggingface.co/facebook/fasttext-language-identification.

[5] https://en.wikipedia.org/wiki/ISO_639:y.

[6] https://huggingface.co/nvidia/NV-Embed-v2.

[7] https://umap-learn.readthedocs.io/.

[8] https://scikit-learn.org.

[9] e.g., https://sunoprompt.com, last accessed March 12, 2025.

[10] https://mister-magpie.github.io/aims_prompts/.

[11] https://discord.com/channels/1069381916492562582/1231954456007151736/1253680890593280062, last accessed March 12, 2025.

[12] https://everynoise.com, last accessed March 12, 2025.

[13] https://www.imit.org.uk/pages/a-to-z-of-musical-instrument.html, last accessed March 12, 2025.