Have a personal or library account? Click to login
Pop Music Highlighter: Marking the Emotion Keypoints Cover
Open Access
|Sep 2018

Figures & Tables

tismir-1-1-14-g1.png
Figure 1

Architecture of two attention-based models using different fusion methods for highlight extraction. We note that model (a) was used by Huang et al. (2017b) and model (b) by Ha et al. (2017).

Table 1

Network architecture of the proposed NAMLF (pos) model. For convolutional layers (conv), the values represent (from left to right in the same row): number of filters, kernel size, strides and activation functions. For fully-connected layers, the values represent: number of hidden units, dropout rate and activation functions. All layers use batch normalization. We also show the size of the input, output and all the intermediate output of the training stage.

(Assume mini-batch size is 16; each clip has 8 chunks) Input 16×8×129×128
reshape, {Xt}128×129×128
Feature extraction
conv643 × 128(2, 128)ReLU
conv1284 × 1(2, 1)ReLU
conv2564 × 1(2, 1)ReLU
global max-pool to 128×256
reshape, {ht}16×8×256
Attention mechanism
add positional encodings 16×8×256
fully-connected2560.5ReLU
fully-connected2560.5ReLU
fully-connected2560.5tanh
fully-connected10.5linear
softmax along the second axis 16×8×1
reshape, {at}16×8
Chunk-level prediction
fully-connected10240.5ReLU
fully-connected1900.5softmax
{ˆyt}16×8×190
Song-level prediction
ˆy=ˆytαt
Output, {ˆy}16×190
Table 2

Performance of different music highlight extraction methods for chorus detection.

MethodF-measureRecallPrecision
Upper bound0.94930.99970.9173
UnsupervisedMiddle0.35580.47080.2943
Spectral energy0.75620.86080.6960
Spectral centroid0.53850.62850.4867
Spectral roll-off0.50800.60590.4563
Repetition0.47950.59730.4110
EmotionRNAM-LF0.78030.90060.7097
NAM-LF (pos)0.79940.90170.7397
NAM-EF (pos)0.76860.87270.7073
NAM-LF0.77390.87600.7120
GenreRNAM-LF0.63140.74880.5663
NAM-LF (pos)0.58910.69930.5273
NAM-EF (pos)0.46880.56490.4167
NAM-LF0.56850.67250.5127
tismir-1-1-14-g2.png
Figure 2

Top row: the ground truth chorus sections, where different colors indicate different chorus sections (e.g., chorus A and chorus B) of a song. Second row: the energy curve. Last four rows: the attention curves estimated by four different emotion-based models, for three songs in RWC-Pop. From left to right: ‘Disc1/006.mp3’, ‘Disc2/003.mp3’ and ‘Disc3/008.mp3’. In RNAM-LF, we have an attention score for each 1-second audio chunk, following our previous work (Huang et al., 2017b); for the other three attention-based methods, we have an attention score for each 3-second audio chunk. The red regions mark the resulting 30-second highlights. More examples can be found on the github page.

tismir-1-1-14-g3.png
Figure 3

Last four rows: Attention curves and the resulting 30-second highlights of different attention-based methods, all genre based, for the same three songs used in Figure 2 (see Figure 2 caption for details).

tismir-1-1-14-g4.png
Figure 4

Results of chorus detection by fusing the energy curve with the attention curve estimated by either (a) emotion-based NAM-LF (pos) or (b) genre-based NAM-LF (pos).

DOI: https://doi.org/10.5334/tismir.14 | Journal eISSN: 2514-3298
Language: English
Submitted on: Mar 3, 2018
Accepted on: Jun 3, 2018
Published on: Sep 4, 2018
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2018 Yu-Siang Huang, Szu-Yu Chou, Yi-Hsuan Yang, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.