
Figure 1
Fixed-Grid representation of a 1-measure pattern for 12 drums in a web interface designed by Yuri Suzuki (808303.studio) and inspired by Roland’s TR-808 Rhythm Composer.

Figure 2
Fixed-Grid representation of a 1-measure pattern for a single drum in the interface for Propellerhead’s ReDrum drum machine.

Figure 3
(a) One measure of drums from the Groove MIDI Dataset visualized in pianoroll format. In a grid at 16th-note resolution, 9 of the 15 snare drum hits in this measure would be mapped to duplicate slots in a matrix; of these, only 3 notes (colored in yellow) could be kept, and the other 6 (colored in red) would need to be discarded or quantized. (b) Mapping drum onset events to slots in our proposed Flexible Grid data representation. Red notes are considered secondary. Each instrument channel (kick, snare, hi-hat, etc.) receives one primary event per 16th note timestep, and space for secondary events is distributed with the minimum number of slots needed to fit the densest passages in the training set. Every event here has two continuous modification parameters for velocity and timing offsets.
Table 1
Statistics of the Groove MIDI Dataset used to build a Flexible Grid Representation at 16th note resolution.
| Drum | Max # of Onsets within 1/16 Note |
|---|---|
| Kick | 3 |
| Snare | 7 |
| Closed Hi-hat | 4 |
| Open Hi-hat | 3 |
| Low Tom | 3 |
| Mid Tom | 3 |
| Hi Tom | 3 |
| Crash Cymbal | 2 |
| Ride Cymbal | 2 |
| Total | 30 |
Table 2
Statistics of the counts and percentages of events in the Groove MIDI Dataset training data that would be quantized or dropped by different data representations, before any modeling takes place. Variable length sequences in the Event-Based representation are between 4 and 300 tokens long.
| Representation | # Skipped | % Skipped | Size |
|---|---|---|---|
| Fixed-Grid (16) | 24038 | 6.94% | 32 × 9 × 3 |
| Fixed-Grid (32) | 9875 | 2.85% | 64 × 9 × 3 |
| Fixed-Grid (64) | 3210 | 0.92% | 128 × 9 × 3 |
| Event-Based | 348 | 0.10% | X × 168 |
| Flexible Grid (16) | 0 | 0 | 32 × 30 × 3 |

Figure 4
Results of a blind head-to-head listening survey. Eleven drummers each participated in 15 trials for this survey, each of them choosing between pairs of two-measure drum loops generated by VAE’s trained on each of three data representations.

Figure 5
VAE Reconstruction (F1 scores per onset), plotted against sequences with increasingly more drumrolls and fast gestures. Data are aggregated such that the leftmost point on the line includes all drum sequences, the next point includes all drum sequences that have at least one event captured in the secondary matrix S, and so on.
Table 3
Accuracy Scores Classifying Drummer Identity with an MLP neural network, with 95% bootstrap confidence intervals. The Event-Based representation is excluded here because the variable-length representation does not enable modeling with a feed-forward classification model.
| Representation | Drummer ID | Genre ID |
|---|---|---|
| Fixed-Grid (16) | 0.634 ±0.027 | 0.547 ±0.026 |
| Fixed-Grid (32) | 0.650 ±0.026 | 0.544 ±0.026 |
| Fixed-Grid (64) | 0.615 ±0.026 | 0.519 ±0.026 |
| Event-based | N/A | N/A |
| Flexible Grid (16) | 0.683 ±0.024 | 0.540 ±0.027 |
