mediaeval 2015 - multi-scale approaches to the mediaeval 2015 "emotion in music" task
TRANSCRIPT
Multi-scale Approaches to the MediaEval 2015 “Emotion in Music” Task
Mingxing Xu, Xinxing Li, Haishu Xianyu, Jiashen Tian, Fanghang Meng, Wenxiao Chen
Human-Computer Speech Interaction Lab. (HCSIL) Department of Computer Science and Technology
Tsinghua University, Beijing, China
1
Motivation / Main Idea
1. High correlation among the music feature sequence
2. Multi-scale methods at three different levels
• Acoustic feature (run 3)
• Regression model (run 1, 2)
• Emotion annotation (run 4)
Acoustic Feature
Regression Model
Emotion Annotation
2
Feature Learning with Hierarchical Deep Neural Networks (DBNs + AE) Run 3
Acoustic features were organized into 4 groups according to theirphysical fundamentals and time scales on which they were extracted. NOTE: We submitted a paper to AAAI, containing details about this framework. 3
60 ms 25 ms25 ms25 ms
win: 1s; shift: 0.5s
final features @ 2 Hz
BLSTM_60
baseline features @ 2 Hz
60
30 30
20
10
BLSTM_30
BLSTM_20
BLSTM_10
20 20
10 10 10 10 10
Fusion
Dynamic Music Emotion (Arousal, Valence)
Multi-scale BLSTM-RNNs based Fusion (1)
Run 1, 2Run 3
New Features
NOTE: • BLSTM-RNNs: 5 hidden layers (2 layers pre-trained), 250 units • Sequence length (time-scale): 60, 30, 20, 10 • Sliding window with 50% overlap used during full-song testing
4
Multi-scale BLSTM-RNNs based Fusion (2)
411 clips 20 clips
411 clips 20 clips
411 clips20 clips
411 clips20 clips
411 clips20 clips
trail 1 trail 2 trail 3
partition 1 RMSE 11 RMSE 12 RMSE 13
partition 2 RMSE 21 RMSE 22 RMSE 23
partition 3 RMSE 31 RMSE 32 RMSE 33
partition 4 RMSE 41 RMSE 42 RMSE 43
partition 5 RMSE 51 RMSE 52 RMSE 53
5 different partitions: select 20 clips randomly
as the validation set
3 trails of the same model: randomized initial weights
Two criteria for model selection: 1. RMSE-first: select the model with the best RMSE for each time scale
2. RMSE+PARTITION: consider both RMSE and partition
5
BLSTM_10BLSTM_60 BLSTM_30 BLSTM_20
ELM 1
+ Delta, + Smoothing
BLSTM_10BLSTM_20
GROUP 1 RMSE-first
GROUP 2 RMSE + PARTITION
ELM 2
AVERAGE
Dynamic Music Emotion (Arousal, Valence)
AVERAGE
Run 1 Run 2Run 3
Multi-scale BLSTM-RNNs based Fusion (3)
6
triangle filter, length: 50
SVR based Hierarchical Regression
songGlobal feature Global SVR
Local feature Local SVR
SUMDynamic Music Emotion
(Arousal, Valence)
clip ( 30 s )Dynamic Music Emotion
(Arousal, Valence)
Global trend
Local fluctuation
Global SVR
Local SVR
Global feature
Local feature
Global features: OpenSMILE, IS13_ComParE, 6373 Local features: OpenSMILE, IS13_ComParE_lld,130, MEAN + STD, Win: 1s, Shift: 0.5s
Run 4
7
Conclusions:
1. Several multi-scale approaches at three levels were proposed.
2. Results illustrated the effectiveness of our new methods.
3. Multi-scale BLSTMs based Fusion with ELMs (Run 2) was almost the best.
4. SVR based Hierarchical Regression is a promising method.
Future Work:
• Select the time scale automatically and systematically
• Improve multi-scale feature learning8
BLSTM-AVGBLSTM-ELMR1 + NEW-FSVR-HR
Thank you for your attention!
Questions?
9