deep learning meetup #5

Deep learning formusic recommendation

Aloïs Gruson

@nilandmusic@aloisgr niland.io

Who we are

• Founded in 2013 by 2 PhDs who worked at IRCAM

• Won Mirex 2011 in Music Similarity Estimation and Music Classification

• We sell our technology through our API

• A team of 9 today

What we want to do

•Create a high-dimensional space where everysong is a vector

•Use this space to find similars and classifysongs

•Each query must be <50ms in millions of tracks

How music information retrieval worked in 2011

• Short-term descriptors: MFCCs, Fluctuation Patterns ("Block-level audio features for music genres classification", Seyerlehner and al.) and much more !

• Pooling techniques : VQ, GMM-SV ("GMM Supervector for content based music similarity", Charbuillet and al.), Vlad ("Aggregation local descriptors into a compact image representation", Jégou and al.) ...

Audio

MFCCs

Vlad

FP

GMM-SV

One of our evaluation datasets

• Evaluation metrics for search engine : Precision at K or mean Average Precision

• Evaluation set presented here : 8500 tracks in 141 playlists from mainstream music

P@k 1 5 10 20 50

mirex2011 17,48 15,39 13,87 12,23 10,00

From 2013 to 2014 @niland

• How to make a product from research work !

• And a lot of work on short-term descriptors and pooling techniques

• But still completely unsupervised, no real way to match outputs with human perception !

P@k 1 5 10 20 50

mirex2011 17,48 15,39 13,87 12,23 10,00

2014 19,70 16,81 15,37 13,57 11,01

% +12.70 +9.23 +10.81 +10.96 +10.10

Matching algorithm outputs with human perception

•Learn the outputs of a collaborative filtering model

"Deep content-based music recommendation", Oord and al.

•Or use a network trained to classify into groups of similar tracks

Integrating human idea of similarity

•150k tracks in 3500 theme-based albums from of our clients

•Each album represents a genre, mood or an usage

•Each gathers socially similar tracks

• We use outputs from our previous system

• We train it with a classification cost

• And remove the classification layer !

P@k 1 5 10 20 50

2014 19,70 16,81 15,37 13,57 11,01

+deep 23,40 21,09 19,68 18,07 15,19

% +18.78 +25.46 +28.04 +33.16 +37.97

Learning with theme-based albums

What if we want to remove the highly engineered features and pooling techniques ?

Convolutional Neural Networks for Image Recognition :

Source : http://www.clarifai.com/technology

And for music ?

• Mel-Spectrogram (time-frequency representation) as an input : axis have different meanings !

Should we really use square filters ?

• Labels on the whole track (>= 30 seconds) : input is 128x1200 for a 30 second song !

We have to pool along time axis !

And for music ?

Source : Sander Dieleman, http://benanne.github.io/2014/08/05/spotify-cnns.html

And for music ?

Some ideas to slightly improve it :

• Multi-scale pooling

• Reduce max pooling

• Add batch-norm

P@k 1 5 10 20 50

2014+deep 23,40 21,09 19,68 18,07 15,19

CNN 23,85 21,31 19,81 18,06 15,18

Okay, so ?

• Our 2014 system is a mix of 6 different short-term descriptors + 6 different "smart" pooling functions, 10 years of research !

• Has the engineering problem become a data problem ?

P@k 1 5 10 20 50

2014+deep 23,40 21,09 19,68 18,07 15,19

CNN 23,85 21,31 19,81 18,06 15,18

From Fisher Vectors to simple pooling functions?

• A very simple pooling function can give great results !

P@k 1 5 10 20 50

Mean 20,94 19,04 17,69 16,17 13,74

Max 22,21 19,90 18,58 17,07 14,61

Var 21,66 19,46 18,14 16,58 14,13

Mean+Max+Var 23,85 21,31 19,81 18,06 15,18

And with square filters?

•Square filters also seem to work !

P@k 1 5 10 20 50

CNN 23,85 21,31 19,81 18,06 15,18

CNNsq 22,94 20,84 19,79 18,15 15,52

A transferable model for music

• Works also for world music, library music…

• This dataset : 10k tracks from library music, 300 groups

P@k 1 5 10 20 50

2014+deep 30,66 19,99 15,57 11,81 7,93

CNN 29,76 19,82 15,55 11,85 7,80

The spectrogram is still an engineered feature…

Could we learn a better temporal filter bank to replace FFT and mel-filtering ?

“End-to-end learning for music audio", Dieleman and al.

"Learning the Speech Front-end with raw waveform CLDNNs", Sainath and al.

Source: "Learning the Speech Front-end with raw waveform CLDNNs", Sainath and al.

P@k 1 5 10 20 50

Raw 20,11 18,95 17,23 15,91 14,26

Spectro 23,85 21,31 19,81 18,06 15,18

The spectrogram is still an engineered feature…

Maybe we need more data ?

We can improve !

• Add more albums !

• With 500k tracks ? 1M ?

P@k 1 5 10 20 50

25k tracks 19,84 17,98 15,21 14,06 13,41

150k tracks 23,85 21,31 19,81 18,06 15,18

And …

• Add more layers !

"Deep Residual Learning for Image Recognition", He and al.

P@k 1 5 10 20 50

PlainNet9 23,85 21,31 19,81 18,06 15,18

ResNet78 23,87 22,17 20,98 19,38 16,68

And ?

• Data augmentation ?

"Exploring data augmentation for improved singing voice detection with neural networks", Schlüter and Grill

• Recurrent Neural Networks ?

• Siamese Network ?

"An exploration of deep learning in music informatics", Humphrey and al.

• More data ! Or semi supervised approach ?

"Semi-supervised learning with ladder networks", Rasmus and al.

Questions ?

@aloisgr @nilandmusicniland.io

Try it for yourself : http://demo.niland.io

deep learning meetup #5

Technology