
Figure 1
Illustration of Chinese bowed instruments. The image XiQin(奚琴) is from an ancient book written by Chen (1101) and other HuQin instruments are from Liu (1992).
Table 1
Summary of relevant musical performance datasets. Note that all statistics are counted for bowed string instruments and the reported durations refer to solo excerpts. (*) The number of clips is not available in their documentation and the data comes from the randomly generated samples from a sound bank.
| DATASET | #INS | #PT CLIPS | EXCERPTS’ DURATION | ANNOTATION | CONTENT |
|---|---|---|---|---|---|
| SOL | 4 | 12,000 clips, 15-class | N/A | N/A | audio |
| RWC | 4 | 236 clips, 5-class | 33.7min | N/A | audio |
| IPT-cello | 1 | 13.5h *, 18-class | N/A | PTs | audio |
| TU-NOTE | 1 | 1,005 clips, 4-class | 15.8min | note transitions, PTs | audio |
| CVD | 1 | 718 clips, 5-class | 37.3min | N/A | audio |
| HF1 | 1 | N/A | 42.6min | pitch, emotion | audio, transcription |
| URMP | 4 | N/A | 78min | pitch | audio, video, score, transcription |
| TELMI | 1 | N/A | N/A | N/A | audio, video, sensor data |
| CTIS | 8 | 1,072 clips, 11-class | 10.3min | N/A | audio |
| CCOM-HuQin | 8 | 11,992 clips, 12-class | 77min | pitch, PTs | audio, video, score, transcription |
Table 2
PTs in Chinese, Pinyin, similar techniques used in the violin family if applicable and the abbreviations used in this paper. N/A means no corresponding technique in the violin family.
| CH | CH-PINYIN | IN VIOLIN FAMILY | ABBR. |
|---|---|---|---|
| Bowing techniques | |||
| 颤弓 | ChanGong | Tremolo | Tremolo |
| 垫弓 | DianGong | N/A | DianG |
| 顿弓 | DunGong | Martelé | DunG |
| 断弓 | DuanGong | Detaché | DuanG |
| 跳弓 | TiaoGong | Spiccato | TiaoG |
| 抛弓 | PaoGong | Ricochet | PaoG |
| 击弓 | JiGong | N/A | JiG |
| 大击弓 | DaJiGong | N/A | DaJiG |
| Fingering techniques | |||
| 揉弦 | RouXian | Vibrato | Vibrato |
| 滚揉 | GunRou | Rolling Vibrato | RVib |
| 压揉 | YaRou | Pressing Vibrato | PVib |
| 滑揉 | HuaRou | Sliding Vibrato | SVib |
| 滑音 | HuaYin | Portamento | Port |
| 上滑音 | Shang-Hua Yin | Upward Portamento | UPort |
| 下滑音 | Xia-Hua Yin | Downward Portamento | DPort |
| 上回滑音 | Shanghui HuaYin | Up-Down Portamento | UDPort |
| 下回滑音 | Xiahui HuaYin | Down-Up Portamento | DUPort |
| 垫指滑音 | Dianzhi HuaYin | Intermediate Portamento | IPort |
| 颤音 | ChanYin | Trill | Trill |
| 打音 | DaYin | N/A | DaYin |
| 短颤音 | DuanChanYin | Short Trill | ShTrill |
| 长颤音 | ChangChanYin | Long Trill | LoTrill |
| 拨弦 | BoXian | Pizzicato | Pizz |

Figure 2
RMS envelopes of bowing techniques (a-g) and a special fingering technique Pizz (h) with amplitude as y-axis; Pitch trajectories (i-p) of the other fingering techniques with F0 as y-axis.

Figure 3
(a) The floorplan of the recording studio. (b) Examples of three camera views.

Figure 4
The annotation pipeline.

Figure 5
PT annotation examples of (a) Tremolo; (b) DianG; (c) PaoG; (d) Port; (e) Trill; (f) Vibrato.

Figure 6
Statistics for PT short clips: (a) count distribution for HuQin instruments; (b) count distribution of PTs; (c) pitch distribution of HuQin instruments (A4=440Hz); (d) duration distribution of PTs.

Figure 7
Number of notes in excerpts for (a) HuQin instruments, with the percentage of annotated notes; (b) PT distribution of all annotations.

Figure 8
Pitch variation visualization of ground-truth pitch tracks (red) and score representation (black) for typical excerpts played on (a) Banhu, (b) Gaohu, (c) Erhu, (d) Zhuihu.
Table 3
Statistics of training and testing sets.
| DATASET | BOWED-STRING INSTRUMENTS | COUNT |
|---|---|---|
| CTIS-I | Erhu | 787 |
| CTIS-II | Banhu, Soprano Banhu, Alto Banhu, XiQin, Zhonghu, Zhuihu | 285 |
| Hybrid | Erhu | 252 |
| CCOM-HuQin | Erhu, Soprano Banhu, Alto Banhu, Tenor Banhu, Bass Banhu, Gaohu, Zhonghu, Zhuihu | 11,014 |
Table 4
F1 score of classification results.
| DATASET | CNN | CRNN |
|---|---|---|
| Homogeneous | ||
| CTIS-I | 97.54% | 99.19% |
| CCOM-HuQin | 96.07% | 97.85% |
| Heterogeneous (Train Validation/Test) | ||
| CTIS-I/CTIS-II + Hybrid | 69.21% | 70.39% |
| CCOM-HuQin/CTIS-I & II + Hybrid | 77.01% | 87.01% |

Figure 9
(a) Nine-class confusion matrix of CRNN classification result. (b) Spectrogram examples of two pairs of easily confused PTs.
Table 5
SVM classification accuracy on two pairs of confusing PTs.
| COORDINATES | DAYIN/PORT | TRILL/VIBRATO |
|---|---|---|
| X-axis | 87.24% | 77.82% |
| Y-axis | 86.40% | 72.30% |
| Z-axis | 86.31% | 76.33% |

Figure 10
Hand pose visualization of Trill and Vibrato in (a) key-point’s change on x-, y- and z-axis and (b) selected video frames with fingertip labels.

Figure 11
Comparison of RMS envelopes between (a) PaoG on Erhu and (b) ricochet on the violin.
