1 Introduction
Advances in audio processing have enabled new ways to study musical trends at scale using quantitative methods. Although scholars have analyzed music for millennia, large‑scale analyses have only become feasible with the recent availability of large datasets (Bertin‑Mahieux et al., 2011; Serrà et al., 2012; Shalit et al., 2013). Improvements in source separation technology now allow researchers to isolate individual instruments from full mixes (Rouard et al., 2023; Stoller et al., 2018). Despite these developments, vocal lines, often the most prominent and expressive element of a song, remain underexplored in music research (Bürgel et al., 2021; Demetriou et al., 2018).
In this study, we analyze vocal microtiming over time and between genres in a corpus of 88,357 songs spanning 45 years (1965–2010). Using state‑of‑the‑art source separation, we isolate vocal stems from 30–60‑s excerpts of each track. Together, these excerpts total more than 30 days of continuous audio. Along with this article, we release our list of MSD IDs and microtiming characteristics.1
This analysis builds on our prior corpus study examining pitch characteristics of vocals in the same dataset (Georgieva et al., 2024). In that work, we found that musical genres often differ significantly in mean pitch, total variation (the rate of pitch change), and pitch class entropy (the degree of unpredictability for the set of pitches). Over time, mean pitch has increased, while both total variation and pitch class entropy have shown significant decreases. If these two metrics are interpreted as indicators of musical complexity, the trend suggests that vocal performances—in our dataset, at least—have become less complex over time. The current study shifts focus from pitch to vocal timing, investigating microtiming patterns across decades and genres in this large and diverse corpus.
2 Related Work
In recent years, the transformation of music over time has emerged as a topic of growing interest—in part due to the release of open‑source resources like the Million Song Dataset (MSD) (Bertin‑Mahieux et al., 2011). The MSD provides audio features and metadata for one million contemporary tracks, enabling researchers to analyze musical patterns at scale using quantitative methods.
Serrà et al. leveraged MSD features to define musical ‘codewords' and track changes in pitch, timbre, and loudness over time (Serrà et al., 2012). Their findings revealed that more recent songs tend to have less variety in pitch transitions, more uniform timbres, and increased loudness. Similarly, Parmer and Ahn (2019) used the MSD to explore musical complexity from 1960 to 2010, observing stable pitch complexity, declining loudness and rhythmic complexity, and rising timbral complexity. Parmer also examined the Billboard Hot 100 chart,2 finding that the complexity of popular songs tends to cluster around the average for all tracks—a result that supports the inverted U‑shaped model of musical complexity and listener preference (Cheung et al., 2019; Gold et al., 2019; Matthews et al., 2023; Witek et al., 2023).
Other researchers have used the MSD to model musical influence, investigating how particular artists shape the work of others (Shalit et al., 2013). By identifying genre‑representative song clusters, they tracked the evolution of these groups over time, and found that, while musical influence and innovation are not consistently correlated, the most influential songs exhibited higher levels of innovation during two periods: the early 1970s and the mid‑1990s.
A separate study analyzed 17,000 Billboard‑charting songs to examine the ‘evolution of popular music’ in the United States between 1960 and 2010 (Mauch et al., 2015), uncovering three major stylistic shifts occurring in 1964, 1983, and 1991, respectively, based on harmonic and timbral features. Hamilton and Pearce (2024) analyzed US chart‑topping songs from 1950 to 2023 and identified two major ‘melodic revolutions’ in 1975 and 2000, along with a smaller shift in 1996. These revolutions marked significant declines in melodic complexity and increases in note density, suggesting a broader trend toward expressing musical ideas through rhythm and timbre rather than melody (Hamilton and Pearce, 2024). A third recent popular music corpus study introduces the MGPHot dataset, a large‑scale, expert‑annotated dataset of 21,000 Billboard Hot 100 songs (1958–2022) featuring musical attributes such as rhythm, harmony, vocals, and more (Oramas et al., 2025). They identify major stylistic shifts in US popular music and pinpoint musical revolutions (in 1964, 1983, and 2016, plus minor ones in 1991 and 2007), matching the findings of Mauch et al. (2015).
Weiß and Müller (2015) developed a set of tonal complexity features to classify classical music pieces by style, extracting measures such as pitch class distributions and pitch interval patterns to quantify the harmonic and melodic structure. In a related study, Weiß et al. (2018) analyzed jazz solos, quantifying note‑level complexity, phrase structure, and improvisational patterns across a large dataset of recorded performances, showing how computational metrics can reveal systematic stylistic trends in jazz improvisation.
In addition to large‑scale analyses, some researchers have focused on narrower topics over extended periods—for example, tracking the evolution of a single band's performances (Thalmann et al., 2022), examining trends in dynamic range and compression in mainstream music (Deruty and Pachet, 2015), or analyzing changes in spectral content across decades (Pestana et al., 2013).
Specific to singing voice, Alan Lomax's 1959 Cantometrics project analyzed over 4,000 traditional vocal music songs from 400 cultures (Oehrle, 1992). Researchers listened to songs and labeled them with 37 ‘style factors,’ such as ‘group cohesion’ in singing and a tense or relaxed vocal quality. The Cantometrics project suggested a correlation between song style and social norms of cultures. A 2009 paper takes the Cantometrics data and proposes a machine‑learning approach to automatically classify singing ensembles as a cappella or accompanied by instruments (Proutskova and Casey, 2009).
In a more recent study, researchers developed a set of features to capture pitch and melodic embellishments of world vocal performances (Panteli et al., 2017). Using these features, they trained a classifier to distinguish vocal from nonvocal segments and learn a dictionary of singing style elements. A different study categorized a collection of 360 Dutch folk songs and found that the aspects of melody that are important for establishing similarity are contour, rhythm, and motifs (Volk and van Kranenburg, 2012). Finally, in a more recent study, researchers analyzed over 353,000 English‑language song lyrics from 1970 to 2020 across five major genres, revealing that lyrics have become increasingly simple and repetitive over time (Parada‑Cabaleiro et al., 2024). In other prior work, researchers analyzed pitch trends in over 145,000 vocal performances and found systematic genre‑ and decade‑based differences in mean vocal pitch, total variation, and pitch class entropy, suggesting shifts in vocal delivery over time and between genres (Georgieva et al., 2024).
An intriguing area of music research focuses on microtiming—that is, the subtle, expressive deviations in the timing of musical onsets relative to the underlying metrical grid (Frühauf et al., 2013). Listeners experience rhythm not only through the physical timing of sounds but also through internal reference structures—such as meter, pulse, and subdivision—that they project onto the music, creating a perceived sense of timing through the interaction of these layers (Danielsen, 2006).
Previous research has investigated microtiming to characterize the ‘feel’ of genres with complex microtiming patterns such as Uruguayan Candombe drumming and Brazilian Samba (Fuentes et al., 2019; Jure and Rocamora, 2016). An analysis of Malian jembé ensemble performances showed extremely tight onset‑timing distributions across many players and takes (London et al., 2022). The work also documents characteristic nonisochronous beat–subdivision ratios in pieces, suggesting that microtiming in these performances is a stable component of the rhythmic style (Polak, 2010). Other studies have explored how microtiming affects the listener experience, particularly in groove‑based styles like swing and funk. Senn et al. (2016) and Kilchenmann and Senn (2015) found that expert listeners were more sensitive to timing manipulations than nonexperts and that microtiming variations were neither strictly necessary nor detrimental to groove perception.
Danielsen et al. also examined timing and sound in five groove‑based genres—jazz, samba, electronic dance music, hip‑hop, and traditional Scandinavian fiddle music (Danielsen et al., 2023). Through semistructured interviews with musicians and producers, this study revealed that each genre draws on a unique blend of timing, attack shape, timbre, dynamics, and articulation to shape its characteristic sense of groove. In another work, Danielsen et al. demonstrated that microtiming perception is shaped not only by temporal placement but also by acoustic features such as attack, timbre, and duration. Their findings suggest that what listeners perceive as timing is deeply entangled with other sonic qualities of music (Danielsen et al., 2024). Specific to vocal performance recordings, one study presents a score‑informed system for automatically deriving note‑level performance information from monophonic recordings of singers (Devaney et al., 2011).
Brøvig and Danielsen examined microtiming in a few selected songs in their book The Impact of Digitization on Popular Music Sound (Brøvig and Danielsen, 2016). For example, they illustrated microtiming in Snoop Dogg's 2004 song ‘Can I Get a Flicc Witchu,’ showing how the rap vocal aligns with the irregular timing of the bass riff rather than the underlying 4/4 meter. They showed that syllables in the verse often fall late relative to the eighth‑note grid. Two other researchers investigated alignment and timing in rap lyrics and flow (Kautny, 2015; Ohriner, 2019). Ohriner (2019) analyzed expressive timing in hip‑hop flow, showing how small timing deviations serve distinct rhetorical and structural functions within performances by artists like Kendrick Lamar and Eminem. Kautny (2015) explored how rhythmic flow and lyrical delivery interact in rap music. While interesting, this previous research has focused on a handful of songs or artists and is limited in scope. The specific study of microtiming in sung vocals thus remains limited. Here, we use large‑scale computational tools to fill this gap in the literature (Kautny, 2015). Understanding, on a large scale, how singers manipulate timing can shed light on the expressive performance choices that shape a song's overall feel and emotional impact (including aspects such as musical groove). Quantifying these timing patterns deepens our understanding of popular music's stylistic evolution and historical development.
3 Dataset
For our microtiming analysis, we used a subset of 278,619 tracks from the MSD (Bertin‑Mahieux et al., 2011) that had genre labels available in the Tagtraum annotations (Schreiber, 2015). The genre labels were the result of majority voting, pulling from the datasets Top‑MAGD (All Music Guide), LFMGD (Last.fm), and BGD (beaTunes) datasets. We excluded a group of songs due to a low presence of vocals in the excerpt, indicated by a low ratio of root mean square (RMS) energy of the separated vocal stem to RMS energy of the full audio file (see Section 4.1). Following prior work on vocal analysis on the MSD, we used a threshold of 0.08 to exclude estimated nonvocal clips (Georgieva et al., 2024). Songs that did not have the release year available were also dropped. We chose to conduct analyses only starting in the year 1965, as data from before 1965 were sparse.
Finally, we also dropped songs that were not estimated as duple meter in at least 90% of the track (see Section 4.2.2). We chose to study only songs with a duple metrical structure (e.g., 4/4, 2/4, 2/2) because microtiming results more cleanly aggregate to songs with the same time structure, and there are not sufficiently accurate meter estimators to support analysis of songs in odd meter (Böck et al., 2016; Holzapfel et al., 2014). We tested several state‑of‑the‑art downbeat tracking and meter‑estimation models on odd‑meter data and found that their performance was insufficient for reliable analysis. A total 93,640 (64%) excerpts were estimated to be in duple meter in at least 90% of the tracks (see Section 4.2.2). We conducted the analysis only on these tracks. Based on these results, we expect very few nonduple tracks to be part of our final dataset.
We obtained reliable onset alignment in 89,003 of the 93,640 duple meter tracks. In 4,637 tracks, the onsets were sparse and, and after mapping onsets to the relative corresponding subdivision, did not align closely enough to the metrical grid subdivisions to provide opportunity for meaningful analysis (see Section 4.2.4). We also dropped songs with fewer than 10 total estimated vocal onsets. Our final song count is 88,357 songs. Furthermore, in the by‑genre analyses, we only look at the 10 most‑represented genres, illustrated in Figure 1 (top).

Figure 1
Distribution of genres (top) and years (bottom) in the duple‑meter‑only sub‑dataset used for the microtiming analysis.
3.1 Limitations
The dataset consists of 30–60‑s excerpts of songs from the MSD, taken from a private collection of preview clips pulled from 7digital.3 These excerpts typically contain a section in the middle of the song (Schindler et al., 2012). The selection of excerpts may vary by genre—for example, Rap tracks may include spoken or sung segments, potentially introducing genre‑specific bias in which parts of a song are analyzed.
Our dataset also inherits several biases from the MSD. Tracks in the MSD were selected based in part on their association with ‘familiar’ artists, as defined by The Echo Nest, followed by the inclusion of songs from similar artists.4 The dataset also prioritizes artists matching the 200 most frequently used Echo Nest descriptive terms, as well as songs with extreme acoustic characteristics.
As a result, the dataset primarily contains widely‑consumed music, with most tracks originating from North America or Europe. The Rock genre is the most represented in the dataset. Coverage is skewed toward more recent decades (especially post‑1990), and the representation of non‑Western and classical music is minimal. While the Latin genre includes some songs in Spanish or Portuguese, overall, language diversity is limited. Our findings reflect the content and scope of this dataset and cannot be generalized to all music.
4 Method
4.1 Source separation
First, we used source separation to separate the vocal line of each song from the mix. We use the isolated vocal line as it allows for a more detailed analysis of the vocal performance. For source separation, we used Hybrid Transformer Demucs (HT Demucs), a hybrid temporal/ spectral bi‑U‑Net (Rouard et al., 2023). After computing the ratio of the vocal stem's RMS energy to the overall mix's RMS energy, we excluded any songs with a ratio below 0.08 (Figure 2). This ratio was set using a preliminary sub‑sample of the data. The excerpts with a ratio below 0.08 are either purely instrumental songs (nonvocal) or the clip happens to capture a part of the audio file with very few or no vocals (e.g., a guitar solo) (Georgieva et al., 2024). In this paper, microtiming analysis is conducted on the source‑separated vocal stems. We use the mixes to estimate meter, beats, and downbeats.

Figure 2
Distribution of the ratios of the vocal stem root mean square (RMS) energy to the full mix RMS energy in the data. A threshold of 0.08 was used to discard nonvocal clips.
4.2 Microtiming
Next we study microtiming—that is, the small‑scale temporal deviations of vocal onsets with respect to the underlying metrical grid. To conduct microtiming analyses, we performed onset estimation, meter estimation, beat tracking, and downbeat tracking on the dataset. We filtered to select only songs with a duple metrical structure, and we fed this information to the carat (‘Computer‑Aided Rhythm Analysis Toolbox’) Microtiming Toolbox to quantify any microtiming in the vocal lines of songs (Rocamora and Jure, 2019). Of note, carat is one of the few tools that supports onset‑level analysis relative to metrical grids, and it is tailored for microtiming analysis.
4.2.1 Onset estimation
To study onset characteristics, we evaluated onset detection accuracy on a small subset of the MSD, consisting of 90 10‑s vocal stems that the first author manually annotated. The selected songs cover all 10 genres and seven decades, with 6–20 songs from each decade, with more songs from the decades that are more represented in the dataset. We selected between five and seven songs from each genre—except for Rock, the far most represented genre, for which we included 20 songs. We release this dataset of vocal onset annotations as an open‑source resource.5
We used the SuperFlux (Böck and Widmer, 2013) method for onset detection, as implemented in Librosa v0.8.1 (McFee et al., 2021). SuperFlux is designed to improve onset detection accuracy in the presence of vibrato and is well‑suited for onset detection in singing voice.
First, we computed a magnitude Mel spectrogram using an FFT window size of 2048 samples and a hop length of 220 samples. We used 69 Mel bands spanning a frequency range of 100–8000 Hz and applied the fully logarithmic HTK mapping. We set the power parameter to 1.0 so that the spectrogram represented the magnitude of the signal, and we compressed the spectrogram using . Next, we tuned the onset strength envelope function, auditioning several options for lag and max size, eventually settling on 4 and 3, respectively.
After we calculated the onset envelope, we evaluated several options for a low‑pass filter cutoff and order to smooth out small peaks in the onset envelope. This step was especially valuable given that our source‑separated stems occasionally had noise that triggers onsets in unvoiced segments. We ultimately settled on a fourth‑order Butterworth filter with cutoff of 8 Hz, implemented in SciPy (Virtanen et al., 2020) version 1.9.3. Finally, we experimentally selected 0.1 as our delta value for the onset detect function and chose a symmetric window in which to select local maxima (pre_max = post_max = 30 ms).
In our validation experiment on the annotated 90 10‑s excerpts, we used mir_eval with a default evaluation window of 50 ms, achieving an F‑measure of 0.742, precision of 0.750, and recall of 0.765 (Raffel et al., 2014). Occasional errors come from the false alarm detection of breaths/inhales, non‑onset consonants, and noise coming through in unvoiced regions of the source‑separated stems, as well as occasional disagreement between the human annotator and the onset detector (see Section 4.2.4).
4.2.2 Meter estimation
We chose to analyze only songs with a duple metrical structure (e.g., 2/4, 2/2, 4/4), which we analyzed using four subdivisions per bar. In order to estimate meter, we used a probabilistic approach for extracting time signature information (Cozens and Godsill, 2024). Each track is labeled with the most common meter detected across its duration.
We evaluated this method on a dataset consisting of the 222 tracks from the Hainsworth dataset, mostly in 4/4 time (Hainsworth and Macleod, 2004), and 32 tracks from the McGill Billboard dataset (Burgoyne et al., 2011), which were selected to be in one of the following meters: 3/4, 5/4, 6/8, 7/8, or 12/8. On these 254 tracks, our method achieved a precision of 0.967, recall of 0.70, and F‑measure of 0.812. Here, the positive class is 4/4 and the negative class is all other meters. The high precision score indicates that almost all of the songs we estimated as 4/4 were indeed 4/4. In this context, we prioritized avoiding false positives—incorrectly labeling a non‑4/4 song as 4/4 would undermine the reliability of our analyses.
4.2.3 Beat and downbeat tracking
We used the Beat This! beat and downbeat tracker (Foscarin et al., 2024). To evaluate the tool, we ran beat and downbeat tracking on the 222 Hainsworth tracks. For beats, we report an F‑measure of 0.975, and for downbeats, one of 0.920. These are evaluations for the whole dataset, of which only 19 tracks were reported to be not in 4/4. With those non‑4/4 tracks dropped, we report F‑measures of 0.980 and 0.920 for beats and downbeats, respectively.
4.2.4 Microtiming analysis
To study microtiming, we used the carat microtiming library (Rocamora and Jure, 2019). Microtiming investigates musical events—in this case, vocal onsets—that exhibit small temporal deviations (microtiming) with respect to the underlying metrical grid (Davies et al., 2013). Carat allows us to aggregate findings among many tracks.
We mapped each vocal onset to the nearest metrical subdivision and computed its timing deviation from that point, using a tolerance of 1/8. Therefore, each vocal onset was rounded either to 0.0, 0.25, 0.5, or 0.75—equal subdivisions of the beat, as estimated with the beat tracker. Using carat, we fit a normal distribution to the onsets in each subdivision and computed the mean and standard deviation values. The means often fell close to the values of 0.0, 0.25, 0.5, and 0.75, but sometimes early or late, in which case we refer to the effect as microtiming.
We performed this analysis first for beats, studying the divisions of the musical bar. We call the beats beat 1, beat 2, beat 3, and beat 4. We also conducted the analysis for subdivisions of beats, which we call N, E, +, A (e.g., one might count 16th notes as ‘1 e and a, 2 e and a’). An example of a carat output for one song's beats and beat subdivisions is illustrated in Figure 3.

Figure 3
Visualizing one 30‑s song excerpt using the carat microtiming tool. Top: Each dot represents a vocal onset distributed among four equal subdivisions of a bar (1, 2, 3, 4), Bottom: Onsets distributed among four equal subdivisions of a beat (N, E, &, A). The horizontal axis shows the timing deviation (in beat‑normalized time) relative to the mean onset location for each metrical position. In the vertical axis, each row corresponds to an individual onset event from the excerpt, allowing all onsets to be displayed simultaneously. At the bottom, a histogram of the onset locations is depicted. We fit a normal distribution to the onsets in each subdivision; the mean is shown as a vertical dotted line and as a percent, and the standard deviation is shown as horizontal ‘arms.’ The straight‑timed (left) plot is an illustration of MSD track ID: TRUUJZA128F931525C, Mr. Brown by Styles of Beyond.6 The deviated example (right) illustrates MSD track ID: TRIOFIX128F93156AD, Sunndal Song by The Apples in Stereo.7
To illustrate the reliability of the onset and beat‑estimation methods, we include two visual examples of short audio excerpts (see Figure 4). In each example, the top row displays a vocal spectrogram with estimated vocal onsets overlaid and the bottom row shows the corresponding full mix spectrogram with annotated beats and downbeats. The left column features a performance with relatively straight timing, while the right shows one with microtiming deviations.

Figure 4
Examples of vocal and mix spectrograms with overlaid estimated onset, beat, and downbeat annotations, comparing a straight‑timed performance (left) with one showing microtiming deviations (right). The straight‑timed (left) plot is an illustration of MSD track ID: TRUUJZA128F931525C, Mr. Brown by Styles of Beyond.6 The deviated example (right) illustrates MSD track ID: TRIOFIX128F93156AD, Sunndal Song by The Apples in Stereo.7
To assess whether inaccuracies in the onset detector were correlated with metrical position, potentially biasing our analyses, we conducted a series of two‑sample Kolmogorov–Smirnov (KS) tests comparing estimated vocal onsets to annotated vocal onsets. We ran the carat microtiming tool separately on estimated onsets and annotated onsets from our 90 onset‑annotated vocal stems (see Section 4.2.2). We compared onset locations to beats and beat subdivisions; beats were estimated using Beat This! on the mix tracks (see Section 4.2.3).
We performed KS tests for each of the four beats, then for each of the four beat subdivisions. KS tests were implemented using the ks_2samp function from SciPy (Virtanen et al., 2020). To account for multiple comparisons, we applied a Bonferroni correction (corrected p < 0.00625, accounting for eight tests). For the four beats, we did not observe any statistically significant differences: beat 1 (KS = 0.097, p = 0.787, n = 78 vs. 86), beat 2 (KS = 0.087, p = 0.909, n = 72 vs. 77), beat 3 (KS = 0.109, p = 0.749, n = 66 vs. 73), and beat 4 (KS = 0.121, p = 0.605, n = 69 vs. 76). Among beat subdivisions, three of the four subdivisions yielded nonsignificant results: beat subdivision N (KS = 0.175, p = 0.141, n = 78 vs. 86), E (KS = 0.144, p = 0.557, n = 48 vs. 66), and A (KS = 0.103, p = 0.834, n = 60 vs. 75). Only the + subdivision showed a statistically significant difference after Bonferroni correction (KS = 0.280, p = 0.0038, n = 74 vs. 79).
This difference suggests that estimation errors may occur more frequently at the + subdivision. In the majority of cases, we failed to reject the null hypothesis, indicating that estimated vocal onsets generally align well with annotated onsets.
4.3 Statistical analyses
We used R (4.2.2; R Foundation for Statistical Computing, Vienna, Austria) and RStudio (2022.12.0 + 353; Posit, PBC, Boston, MA, USA) to implement linear regression with the lm function. Post‑hoc tests were implemented using the emmeans package with Tukey correction for multiple comparisons.
5 Results and Analysis
For beats and beat subdivisions, we followed the same procedure. We first ran a linear regression to examine the relationship between the variable of interest (e.g., beat 1) and musical genre (e.g., beat 1~genre). We then calculated a linear regression between the variable of interest and the year of track release (e.g., beat 1~year). Finally, we calculated independent linear regressions between the variable of interest and year of track release for the 10 most frequently occurring genres. We calculated independent regressions because each of the genres becomes prevalent in the dataset during different years (e.g., Rock music becomes prevalent in 1965, Rap becomes prevalent in 1984). We define this as the first year of the earliest 5‑year period in which the dataset contains at least five songs from the genre. The 10 genres with corresponding years were as follows: Country (1965), Electronic (1979), Latin (1986), Metal (1980), Pop (1965), Punk (1977) Rap (1984), Reggae (1972), RnB (1965), and Rock (1965).
5.1 Bars
Figure 5 illustrates our findings for the average onset timing with respect to each beat of the metric bar over time in the 10 most‑represented genres in our dataset. For each beat, we observed a significant main effect of genre (beat 1: F(9, 82987) = 293.64, p < 0.001; beat 2: F(9, 82987) = 58.59, p < 0.001; beat 3: F(9, 82987) = 147.16, p < 0.001; beat 4: F(9, 82987) = 19.14, p < 0.001), indicating that the average timing of vocal onsets varies significantly by genre.

Figure 5
Microtiming in beats (i.e., quarter notes) in each of the 10 genres across the dataset. Means are shown with boxes representing the interquartile range, the horizontal line inside the box marks the median, and the whiskers extend to 1.5 times the interquartile range.
These results indicate that musical genres have significantly different vocal microtiming, especially on beats 1, 2, and 3. In Rap music, onsets generally arrive later than those of other genres. In contrast, genres like Pop and Country tend to show slightly earlier placement. These patterns are illustrated in Figure 5. The pairwise Tukey genre comparisons are reported in Supplementary Table S1.
After calculating a linear regression between each beat and the year of track release for the whole dataset, we found a significant relationship between vocal onset microtiming at beat 1 and the release year of a track ( = 0.000021, t = 2.14, p = 0.032). This suggests a subtle tendency for vocal onsets on beat 1 to occur slightly later over time. Similarly, we found a stronger relationship at beat 2 ( = 0.000109, t = 10.68, p < 0.001) and beat 3 ( = 0.000160, t = 15.95, p < 0.001), with mean onset locations becoming increasingly later in more recent tracks. We also observed a positive trend at beat 4 ( = 0.000034, t = 3.57, p < 0.001), though the effect size was small (see Figure 6). The figures illustrate the trend lines consistently below the isochronous grid locations (i.e. 0, 0.25, 0.5, 0.75), matching the early microtiming from Figure 5. In the plot, a microtiming difference of ±0.06 is equivalent to approximately one 16th note.

Figure 6
Microtiming of each beat as a function of year globally. Each dot represents a song. The red line represents the predicted slope with 95% confidence intervals. The green diamond and ribbon represent the mean per year and the standard error.
Finally, to account for the fact that different genres become prominent in the dataset in different years, we calculated independent linear regressions between each beat and year of track release for each of the 10 genres. For beat 1, five genres showed a significant main effect of year: Country, Electronic, Pop, Punk, and RnB. Of these, only RnB exhibited a positive trend, indicating a tendency for vocal onsets to occur slightly later over time. The rest showed a slight negative trend.
For beats 2 and 3, five genres showed significant effects: Electronic, Pop, Rap, RnB, and Rock. All of these showed positive trends, indicating increasingly later vocal entries over time, closer to the isochronous metrical grid. For beat 4, there was a significant main effect of year in Country, Rap, and Rock music, with Rap and Rock music showing positive trends and Country showing a negative trend, indicating slightly earlier vocal onsets over time in country music. For all four beats, the trend observed in Rock—the most prominent genre—matched the global trend. These plots are available in Supplementary Figure S3.
5.2 Beats
In Figure 7, we report the same analysis we performed for beats above, but applied to beat subdivisions: within a beat as opposed to within a bar. Here, each beat is treated as equivalent (i.e., beats 1, 2, 3, and 4 are all simply treated as beats). We refer to the subdivisions of a beat as N, E, +, A, which are 16th notes. In the plot, a microtiming difference of ±0.06 is equivalent to approximately one 64th note.

Figure 7
Microtiming in beat subdivisions (i.e., 16th notes) across 10 musical genres in the dataset, for all four subdivisions. Each box represents the interquartile range (25th–75th percentile), the horizontal line inside the box marks the median, and the whiskers extend to 1.5 times the interquartile range.
For vocal microtiming at the four beat subdivisions across musical genres, we observed a significant main effect of genre (beat subdivision N: F(9, 78445) = 966.38, p < 0.001; beat subdivision E: F(9, 78445) = 45.23, p < 0.001; beat subdivision +: F(9, 78445) = 383.50, p < 0.001; beat subdivision A: F(9, 78445) = 120.94, p < 0.001). This indicates that the average timing of vocal onsets varies significantly by genre.
These results show that musical genres have significantly different vocal microtiming with respect to beat subdivisions, especially for beat subdivisions N and +. Vocals onsets in the RnB and Country genres tended to arrive the latest. The pairwise Tukey genre comparisons are reported in Supplementary Table S2.
Onsets tended to arrive late in general, but latest for beat subdivisions N and + (the first and third subdivisions, respectively), and there are also generally larger interquartile ranges for those beat subdivisions. We also examined changes in vocal microtiming over time across the four beat subdivisions. For beat subdivisions N and +, we found significant negative relationships with release year: vocal onsets at beat subdivision N ( = −0.000090, t = −9.69, p < 0.001) and beat subdivision + ( = −0.000086, t = −9.08, p < 0.001) occurred increasingly earlier over time.
A different pattern emerged for beat subdivision A (the fourth subdivision), which showed a significant positive relationship with year ( = 0.000243, t = 16.46, p < 0.001), indicating that vocal onsets on beat subdivision A have shifted later in more recent music. Beat subdivision E did not exhibit a significant effect ( = −0.000029, t = −1.73, p = 0.084), though the direction of the trend was negative, like it was for subdivisions N and +. Together, these findings suggest that vocal timing across the beat grid has evolved nonuniformly: beat subdivisions N and + are now sung earlier, while beat subdivision A tends to be delayed. Figure 8 illustrates these changes may be driven by different genres coming into prevalence in the dataset in later years.

Figure 8
Microtiming of each beat subdivision as a function of year globally. Each dot represents a song. The red line represents the predicted slope with 95% confidence intervals. The green diamond and ribbon represent the mean per year and the standard error.
In Supplementary Figure S4, we include plots for the linear regressions between beat subdivision and year for each genre, analyzed independently. For beat subdivision N, there was a significant main effect of year in six genres: Country, Pop, Punk, Rap, RnB, and Rock. Country, Punk, and Rock showed a negative relationship with year, indicating that vocal onsets have shifted earlier over time and therefore closer to the isochronous metrical grid. For subdivision E, there was a significant main effect of year in six genres: Country, Electronic, Punk, Rap, RnB, and Rock. Country and Punk exhibited positive trends, indicating slightly later vocal onsets over time, while the other four genres exhibited negative trends, indicating slightly earlier vocal onsets over time.
For subdivision +, Pop, Rap, and RnB showed positive trends, while Country, Punk, and Rock showed negative trends, indicating slightly earlier vocal onsets over time and approaching the isochronous grid. For subdivision A, there was a significant main effect of year in nine genres—all but Metal. Seven genres showed positive trends (Country, Latin, Pop, Punk, Reggae, RnB, and Rock), indicating that vocal onsets have shifted later over time. For all four subdivisions, the trend direction observed in Rock, the largest genre in the dataset, matched the global trend.
5.3 Stronger and weaker beat subdivisions
To assess whether systematic timing deviations varied by beat subdivision, we modeled the deviation from isochronous grid locations (0, 0.25, 0.5, 0.75) as a function of subdivision position using a linear model. Each subdivision was treated as a categorical predictor (factor), and the dependent variable was the deviation of each vocal onset from its corresponding isochronous metrical target (i.e., 0, 0.25, 0.5, 0.75).
For subdivisions of a bar, the results revealed a significant main effect of subdivision (, ), indicating that deviations were not uniform across the measure. Estimated marginal means showed larger deviations on beats 1 and 3 (means = and , respectively) compared to beats 2 and 4 (means = and ). Pairwise comparisons confirmed that all subdivisions differed significantly from each other after Tukey correction () (see Figure 9).

Figure 9
Microtiming of each beat subdivision pair as a function of year globally. Each dot represents a song. The red line represents the predicted slope with 95% confidence intervals. The green diamond and ribbon represent the mean per year and the standard error.
For within a beat, the linear model revealed a significant main effect of beat (, ), indicating nonuniform timing behavior across the bar. Estimated marginal means showed that beat subdivisions N and + had larger deviations (means = and ), whereas beats E and A exhibited smaller deviations (means = and ). Post‑hoc Tukey tests confirmed that most beat subdivisions differed significantly from one another (), with the exception of beat subdivisions E and A, which did not differ significantly (). This pattern aligns with the hypothesis that vocalists exhibit more precise timing on weaker beats (2 and 4), reinforcing the presence of a stronger–weaker beat structure.
Figure 9 illustrates microtiming of each beat subdivision pair, i.e., the stronger beats N and + and the weaker beats E and A. The stronger beat subdivisions generally arrive late.
6 Discussion
Our analysis revealed that vocal microtiming varies by genre, changes over time, and reflects distinct treatment of beats and beat subdivisions. Songs from the Rap genre tended to have the least timing deviation in all beat subdivisions, and the difference between Rap and other genres was most clear in beats 1 and 3 (see Figure 6). Rap songs in the dataset exhibit a significantly higher average vocal onset–to‑beat ratio than other genres (see Table 1). This is likely due to less sustained pitch in performance, as Rap music tends to be more speech‑like (Kautny, 2015). This finding is also visible in Figure 6, where the interquartile range for Rap is smaller than those for the other genres. We observe more onsets falling closer to the isochronous metrical grid, especially for the four beats. however, this does not necessarily reflect more on‑the‑grid timing. The increased density of onsets may lead to timing deviations that are distributed symmetrically around the grid, effectively canceling each other out. This pattern would create the appearance of grid alignment, even though the underlying microtiming may still be deviated—something carat cannot distinguish. This echoes previous research, where Rap vocals were found to have a consistently higher total pitch variation than other musical genres (Georgieva et al., 2024; Walls et al., 2023).
Table 1
Table depicting the number of songs in each genre and the average of the per‑song vocal onset–to‑beat ratios, with standard deviations.
| Genre | Songs | Onset Density | |
|---|---|---|---|
| Avg. | Std. | ||
| Rap | 5,373 | 2.06 | 1.11 |
| Reggae | 3,058 | 1.09 | 0.69 |
| RnB | 5,533 | 1.04 | 0.44 |
| Latin | 2,136 | 0.95 | 0.43 |
| Pop | 12,475 | 0.83 | 0.36 |
| Country | 3,096 | 0.77 | 0.31 |
| Elec. | 5,878 | 0.70 | 0.60 |
| Rock | 40,079 | 0.67 | 0.37 |
| Metal | 3,547 | 0.61 | 0.30 |
| Punk | 2,555 | 0.59 | 0.24 |
Across the dataset, we observed performers were more expressive with beats 1 and 3 than with beats 2 and 4, consistently anticipating the isochronous grid location. Over time, we observe a subtle but consistent shift in beat vocal entries toward later onset locations (closer to the isochronous metrical target), especially on beats 1, 2, and 3. This shift could be due to increasingly widespread use of digital recording tools, including click tracks and digital editing tools8,9,10 (Brøvig and Danielsen, 2016). The shift could also reflect evolving expressive norms and listener preferences for on the grid performances (Frühauf et al., 2013). The shift of beat vocal onsets toward later onset locations could also be due to more recent songs having more vocal onsets. Indeed, songs from more recent years in the dataset do generally show a significantly higher average vocal onset–to‑beat ratio than older songs, which at least partially explains the trend (see Table 2).
Table 2
Table depicting the number of songs in each demi‑decade and the average of the per‑song vocal onset–to‑beat ratios, with standard deviations.
| Demi‑Decade | Songs | Onset Density | |
|---|---|---|---|
| Avg. | Std. | ||
| 1965–1969 | 1,072 | 0.69 | 0.25 |
| 1970–1974 | 1,701 | 0.78 | 0.31 |
| 1975–1979 | 2,491 | 0.72 | 0.28 |
| 1980–1984 | 3,721 | 0.69 | 0.32 |
| 1985–1989 | 5,103 | 0.72 | 0.39 |
| 1990–1994 | 7,922 | 0.81 | 0.50 |
| 1995–1999 | 12,303 | 0.86 | 0.55 |
| 2000–2004 | 21,158 | 0.87 | 0.55 |
| 2005–2010 | 32,887 | 0.84 | 0.71 |
Per‑genre linear models of onset density over time are illustrated in Supplementary Figure S5. Seven genres (Country, Electronic, Latin, Pop, RnB, Reggae, and Metal) show statistically significant increases in onset density over time, while Punk, Rap, and Rock show no significant change. These results indicate that the overall increase in onset density is driven by a combination of shifts in genre prevalence and within‑genre trends.
Work by Danielsen and colleagues shows that perceived timing is shaped not only by temporal placement but also by attack shape, timbre, duration, dynamics, and articulation, such that apparent timing deviations may arise from these nontemporal features (Danielsen et al., 2023, 2024). This aligns with vocal‑specific findings from Devaney et al., who note that differences in consonant– vowel articulation and attack can lead to later or less sharply defined onsets (Devaney et al., 2011).
For beat subdivisions, our most extreme result was RnB onsets in beat subdivision N, on average arriving a little over a 32nd note late. Metal and Latin music showed onsets arriving around a 32nd note late. These genres may encourage more expressive timing. Just like for beats, for beat subdivisions, we see a shift in at least three of the four subdivisions toward the isochronous metrical target, again likely due to digital editing tools, evolving norms, an increase in number of onsets per song, or increasing musical tempo.
Stronger metrical positions (subdivisions N and +) were more likely to be delayed compared to weaker ones (E and A). This suggests artists take more artistic liberties in these ‘stronger’ subdivisions and treat them differently from the ‘weaker’ subdivisions. It is possible that other instruments are more strongly in‑time on these beat subdivisions, giving the singer more expressive freedom. Together, our results suggest that rhythmic expression is conditioned by both musical timing context and genre‑specific conventions.
7 Conclusion
In this exploratory research, we examined vocal microtiming trends in 88,357 songs spanning 45 years. Our findings reveal that vocal microtiming in popular music differs significantly by genre, has changed over musical eras, and exhibits distinct patterns across beats and beat subdivisions.
The carat tool captures only mean timing deviation, making it insensitive to symmetrical timing fluctuations that average out on the grid. Future work can explore new tools to measure absolute microtiming deviation and better represent vocal lines in Western popular music. Future work could also extend this approach to transition probabilities, examining how the likelihood of a note being ‘on time’ at each metrical position depends on the surrounding notes.
We have demonstrated the utility of the methods presented here for studying vocals and believe they have the potential to be applied to the study of other musical instruments and styles. While our findings reflect the Western focus of the MSD, they highlight microtiming as a meaningful marker of style and expression. Future work could expand this analysis to more diverse vocal practices.
Authorship Contributions
Elena Georgieva: Writing—Original Draft, Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, Resources, Software, Validation, Visualization Daniel Fernandes: Writing—Review & Editing, Methodology, Software Ethan Cajetan Menezes: Writing—Review & Editing, Methodology, Software Valerian Coelho: Writing—Review & Editing, Methodology, Software Magdalena Fuentes: Writing—Review & Editing, Conceptualization, Formal Analysis, Investigation, Methodology, Resources, Software, Supervision, Visualization Pablo Ripollés: Writing—Review & Editing, Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, Resources, Software, Supervision, Visualization Brian McFee: Writing—Review & Editing, Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, Resources, Software, Supervision, Visualization
Data Accessibility
We provide a complete list of MSD track IDs used in this work, along with all extracted microtiming characteristics on Zenodo: https://zenodo.org/records/18500030.
The vocal onsets we annotated to evaluate onset detection performance are available on GitHub: https://github.com/elenatheodora/MSD_Annotated_Onsets
Funding Information
This work was funded by the NYU Steinhardt Doctoral Fellowship, the SpokenWeb Project, and The NYU Digital Humanities Summer Fellowship.
Competing Interests
The authors have no competing interests to declare.
Notes
Additional File
The additional file for this article can be found as follows:
Supplementary Appendix A
Supplementary Microtiming Tables and Plots. DOI: https://doi.org/10.5334/tismir.278.s1.
