deep learning meetup #5
TRANSCRIPT
Deep learning formusic recommendation
Aloïs Gruson
@nilandmusic@aloisgr niland.io
Who we are
• Founded in 2013 by 2 PhDs who worked at IRCAM
• Won Mirex 2011 in Music Similarity Estimation and Music Classification
• We sell our technology through our API
• A team of 9 today
What we want to do
•Create a high-dimensional space where everysong is a vector
•Use this space to find similars and classifysongs
•Each query must be <50ms in millions of tracks
How music information retrieval worked in 2011
• Short-term descriptors: MFCCs, Fluctuation Patterns ("Block-level audio features for music genres classification", Seyerlehner and al.) and much more !
• Pooling techniques : VQ, GMM-SV ("GMM Supervector for content based music similarity", Charbuillet and al.), Vlad ("Aggregation local descriptors into a compact image representation", Jégou and al.) ...
Audio
MFCCs
Vlad
FP
GMM-SV
One of our evaluation datasets
• Evaluation metrics for search engine : Precision at K or mean Average Precision
• Evaluation set presented here : 8500 tracks in 141 playlists from mainstream music
P@k 1 5 10 20 50
mirex2011 17,48 15,39 13,87 12,23 10,00
From 2013 to 2014 @niland
• How to make a product from research work !
• And a lot of work on short-term descriptors and pooling techniques
• But still completely unsupervised, no real way to match outputs with human perception !
P@k 1 5 10 20 50
mirex2011 17,48 15,39 13,87 12,23 10,00
2014 19,70 16,81 15,37 13,57 11,01
% +12.70 +9.23 +10.81 +10.96 +10.10
Matching algorithm outputs with human perception
•Learn the outputs of a collaborative filtering model
"Deep content-based music recommendation", Oord and al.
•Or use a network trained to classify into groups of similar tracks
Integrating human idea of similarity
•150k tracks in 3500 theme-based albums from of our clients
•Each album represents a genre, mood or an usage
•Each gathers socially similar tracks
• We use outputs from our previous system
• We train it with a classification cost
• And remove the classification layer !
P@k 1 5 10 20 50
2014 19,70 16,81 15,37 13,57 11,01
+deep 23,40 21,09 19,68 18,07 15,19
% +18.78 +25.46 +28.04 +33.16 +37.97
Learning with theme-based albums
What if we want to remove the highly engineered features and pooling techniques ?
Convolutional Neural Networks for Image Recognition :
Source : http://www.clarifai.com/technology
And for music ?
• Mel-Spectrogram (time-frequency representation) as an input : axis have different meanings !
Should we really use square filters ?
• Labels on the whole track (>= 30 seconds) : input is 128x1200 for a 30 second song !
We have to pool along time axis !
And for music ?
Source : Sander Dieleman, http://benanne.github.io/2014/08/05/spotify-cnns.html
And for music ?
Some ideas to slightly improve it :
• Multi-scale pooling
• Reduce max pooling
• Add batch-norm
P@k 1 5 10 20 50
2014+deep 23,40 21,09 19,68 18,07 15,19
CNN 23,85 21,31 19,81 18,06 15,18
Okay, so ?
• Our 2014 system is a mix of 6 different short-term descriptors + 6 different "smart" pooling functions, 10 years of research !
• Has the engineering problem become a data problem ?
P@k 1 5 10 20 50
2014+deep 23,40 21,09 19,68 18,07 15,19
CNN 23,85 21,31 19,81 18,06 15,18
From Fisher Vectors to simple pooling functions?
• A very simple pooling function can give great results !
P@k 1 5 10 20 50
Mean 20,94 19,04 17,69 16,17 13,74
Max 22,21 19,90 18,58 17,07 14,61
Var 21,66 19,46 18,14 16,58 14,13
Mean+Max+Var 23,85 21,31 19,81 18,06 15,18
And with square filters?
•Square filters also seem to work !
P@k 1 5 10 20 50
CNN 23,85 21,31 19,81 18,06 15,18
CNNsq 22,94 20,84 19,79 18,15 15,52
A transferable model for music
• Works also for world music, library music…
• This dataset : 10k tracks from library music, 300 groups
P@k 1 5 10 20 50
2014+deep 30,66 19,99 15,57 11,81 7,93
CNN 29,76 19,82 15,55 11,85 7,80
The spectrogram is still an engineered feature…
Could we learn a better temporal filter bank to replace FFT and mel-filtering ?
“End-to-end learning for music audio", Dieleman and al.
"Learning the Speech Front-end with raw waveform CLDNNs", Sainath and al.
Source: "Learning the Speech Front-end with raw waveform CLDNNs", Sainath and al.
P@k 1 5 10 20 50
Raw 20,11 18,95 17,23 15,91 14,26
Spectro 23,85 21,31 19,81 18,06 15,18
The spectrogram is still an engineered feature…
Maybe we need more data ?
We can improve !
• Add more albums !
• With 500k tracks ? 1M ?
P@k 1 5 10 20 50
25k tracks 19,84 17,98 15,21 14,06 13,41
150k tracks 23,85 21,31 19,81 18,06 15,18
And …
• Add more layers !
"Deep Residual Learning for Image Recognition", He and al.
P@k 1 5 10 20 50
PlainNet9 23,85 21,31 19,81 18,06 15,18
ResNet78 23,87 22,17 20,98 19,38 16,68
And ?
• Data augmentation ?
"Exploring data augmentation for improved singing voice detection with neural networks", Schlüter and Grill
• Recurrent Neural Networks ?
• Siamese Network ?
"An exploration of deep learning in music informatics", Humphrey and al.
• More data ! Or semi supervised approach ?
"Semi-supervised learning with ladder networks", Rasmus and al.
Questions ?
@aloisgr @nilandmusicniland.io
Try it for yourself : http://demo.niland.io