arxiv:2008.00582v3 [cs.sd] 7 sep 2020

audioLIME: Listenable Explanations UsingSource Separation

Verena Haunschmid1, Ethan Manilow3, and Gerhard Widmer1,2

1 Institute of Computational Perception, Johannes Kepler University, Linz, [email protected][0000−0001−5466−7829]

2 LIT Artificial Intelligence Lab, Johannes Kepler University, Linz, Austria3 Interactive Audio Lab, Northwestern University, Evanston, IL, USA

Abstract. Deep neural networks (DNNs) are successfully applied in awide variety of music information retrieval (MIR) tasks but their predic-tions are usually not interpretable. We propose audioLIME, a methodbased on Local Interpretable Model-agnostic Explanations (LIME), ex-tended by a musical definition of locality. The perturbations used inLIME are created by switching on/off components extracted by sourceseparation which makes our explanations listenable. We validate audi-oLIME on two different music tagging systems and show that it producessensible explanations in situations where a competing method cannot.

1 Introduction

Deep neural networks (DNNs) are used in a wide variety of music informationretrieval (MIR) tasks. While they generally achieve great results according tostandard metrics, it is hard to interpret how or why they determine their output.This can lead to situations where a network does not learn what its designersintend. One goal of the field of interpretable machine learning is to provide toolsfor practitioners that push towards making the decisions of opaque models un-derstandable. The field of MIR has many stakeholders–from individual musiciansto entire corporations–all of which must be able to trust DNN systems.

A promising approach to this problem is Local Interpretable Model-agnosticExplanations (LIME) [6], which produces explanations of predictions from anarbitrary model post-hoc by perturbing interpretable components around an in-put example and fitting a small, surrogate model to explain the original model’sprediction. Previous attempts at adopting LIME for MIR tasks have used rect-angular regions of a spectrogram for explanations [5]. This ignores two definingcharacteristics of audio data: 1) the lack of occlusion of overlapping sounds and,2) all parts of a single sound might not be contiguous on a spectrogram.

In this work, we introduce audioLIME, an extension of LIME that preservesfundamental aspects of audio so explanations are listenable. To achieve this wepropose a new notion of “locality” based on estimates from source separationalgorithms. We evaluate our method on music tagging systems by feeding theexplanation back into the tagger and seeing if the prediction changes. Usingthis technique, we show that our method is able to explain predictions froma waveform-based music tagger, which previous methods cannot do. We alsoprovide illustrative examples of listenable explanations from our system.

arX

iv:2

008.

0058

2v3

[cs

.SD

] 7

Sep

202

0

2 V. Haunschmid et al.

Input

f(x)

Source Separation InterpretableRepresentation

x' ∈ {0, 1}d'

Compute labels

(1) Map z' to z (2) f(z) ➜ label for z'

Train interpretablemodel

g(z') ≈ f(z)

Perturb samples

z1' = {1, 1, 0, ...}...

zn' = {1, 0, 0, ...}

Fig. 1: audioLIME closely follows the general LIME pipeline. The key is the useof source estimates (blue box). A source separation algorithm decomposes inputaudio into d′ = C × τ interpretable components (C sources, τ time segments).

2 audioLIME

audioLIME is based on the LIME [6] framework and extends its definition oflocality for musical data by defining a new way of deriving an interpretablerepresentation. For an input value x to an arbitrary, black-box model f , LIMEfirst defines a set of d′ interpretable features that can be turned on and off,x′ ∈ {0, 1}d′

. These features are perturbed and represented as a set of binaryvectors z′n that an interpretable, surrogate model trains on. This surrogate modelmatches the performance of the black-box model around x and is able to revealwhich of its interpretable features the black-box model relies on.

The key insight of audioLIME is that interpretability with respect to au-dio data should really mean listenability. Whereas previous approaches appliedtechniques from the task of image segmentation to spectrograms, we proposeusing source separation estimates as interpretable representations. This givesaudioLIME the ability to train on interpretable and listenable features. 4

The single-channel source separation problem is formulated as estimating aset of C sources, S1, ..., Sc, when only given access to the mixture M from whichthe sources are constituents. We note that this definition, as well as audioLIME,is agnostic to the input representation (e.g., waveform, spectrogram, etc) ofthe audio. We use these C estimated sources of an input audio as our inter-pretable components (e.g. {piano, drums, vocals, bass}). Mapping z′ ∈ {0, 1}Cto z (the input audio) is performed by mixing all present sources. For examplez′ = {0, 1, 0, 1} results in a mixture only containing estimates of drums and bass.The relation of this approach to the notion of locality as used in LIME lies inthe fact that samples perturbed in this way will in general still be perceptuallysimilar (i.e., recognized by a human as referring to the same audio piece). Thissystem is shown in Figure 1. In addition to source separation, we also segmentthe audio into τ temporal segments, resulting in C×τ interpretable components.

4 Python package available at: https://github.com/CPJKU/audioLIME

https://github.com/CPJKU/audioLIME

audioLIME: Listenable Explanations Using Source Separation 3

Fig. 2: Percentage of explanations that produced the same tag as the originalinput using k interpretable components for two music tagging systems. audi-oLIME (blue) produces better explanations than SLIME [5] (green) and thebaseline (red).

3 Experiments

We analyze two music tagging models [7]: FCN [2], which inputs a 29 secondspectrogram, and SampleCNN [4], which inputs a 3.69 second waveform. Bothmodels were trained on the MillionSongDatatset (MSD) [1]. The LIME expla-nation model used is a linear regression model trained with l2 regularization on214 samples. We use Spleeter [3] as the source separation system.

Quantitative Results To verify that the explanations truly explain the model’sbehaviour we perform a simple experiment. If the explanation explains themodel’s behaviour we expect the tagger to be able to make the same predictionwhen only passing the top k selected components, and a different predictionotherwise.5

We randomly picked 100 examples from the MSD test set, 20 for each of the5 most common tags (rock, pop, alternative, indie, electronic). For each examplewe create several explanations (3/song for FCN, 16/song for SampleCNN) forthe top predicted tag. We compare two explanation systems, using the the topk components in each explanation from either audioLIME or SLIME [5]. As abaseline, we compare the prediction each tagger makes on k randomly selectedcomponents where audioLIME surrogate models have a positive linear weight.

Figure 2 shows that even when using only a fraction of the components, thetagger makes the same prediction more often with audioLIME than with SLIMEor the baseline. Importantly, because audioLIME’s explanations emphasize lis-tenability, they are invariant to the input audio representation of the model,and thus it is able to provide better explanations than SLIME, which does nothave the same flexibility. This indicates there is a whole class of waveform-basedmodels that SLIME is unsuited for, but audioLIME still works well.

5 Experiment code: https://github.com/expectopatronum/mml2020-experiments/

https://github.com/expectopatronum/mml2020-experiments/

4 V. Haunschmid et al.

Qualitative Results Because the explanations audioLIME makes are source es-timates, it is possible to listen to and make sense of them. To illustrate this,we selected two examples of explanations of a prediction made by FCN.6 In thefirst example, FCN predicted the tag “female vocalist” and, indeed, the top 3selected audioLIME components are the separated vocals with a female singer.In the second case, FCN predicted the tag “rock”, and in the top audioLIMEcomponents we can hear a driving drumset and a distorted guitar, both of whichare associated with rock music. In these cases, we can be confident that our mu-sic tagging network has learned the correct concepts for these tags, and thusincreases our trust in the black-box FCN model.

4 Conclusion

In this work we presented audioLIME, a system that uses source separation toproduce listenable explanations. We demonstrated an experiment that showedhow audioLIME can produce explanations that create trustworthy predictionsfrom music tagging systems that use waveforms or spectrograms as input. Wealso showed two illustrative examples of explanations from audioLIME. One ofthe shortcomings of audioLIME is its dependency on a source separation system,which only works with a limited number of source types and may introduceartifacts. However, we note that audioLIME is agnostic to the source separationsystem, and thus audioLIME is compatible with future work in that space.

5 Acknowledgements

This research is supported by the European Research Council (ERC) underthe European Unions Horizon 2020 research and innovation programme, grantagreement No 670035 (“Con Espressione”).

References

1. Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The Million SongDataset. In: Proc. of the 12th International Society for Music Information RetrievalConference, ISMIR 2011, Miami, Florida, USA (2011)

2. Choi, K., Fazekas, G., Sandler, M.B.: Automatic tagging using deep convolutionalneural networks. In: Mandel, M.I., Devaney, J., Turnbull, D., Tzanetakis, G. (eds.)Proc. of the 17th International Society for Music Information Retrieval Conference,ISMIR 2016, New York City, United States (2016)

3. Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: A Fast AndState-of-the Art Music Source Separation Tool With Pre-trained Models. Late-Breaking/Demo ISMIR 2019 (November 2019), deezer Research

4. Lee, J., Park, J., Kim, K.L., Nam, J.: Sample-level deep convolutional neural net-works for music auto-tagging using raw waveforms. CoRR abs/1703.01789 (2017)

6 https://soundcloud.com/veroamilbe/sets/mml2020-explanation-example

https://soundcloud.com/veroamilbe/sets/mml2020-explanation-example

audioLIME: Listenable Explanations Using Source Separation 5

5. Mishra, S., Sturm, B.L., Dixon, S.: Local Interpretable Model-Agnostic Explana-tions for Music Content Analysis. In: Proc. of the 18th International Society forMusic Information Retrieval Conference, ISMIR 2017, Suzhou, China, 2017 (2017)

6. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why Should I Trust You?”: Explaining thePredictions of Any Classifier. In: Proc. of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, San Francisco, CA, USA(2016)

7. Won, M., Ferraro, A., Bogdanov, D., Serra, X.: Evaluation of CNN-based AutomaticMusic Tagging Models. In: Proc. of 17th Sound and Music Computing (2020)

arxiv:2008.00582v3 [cs.sd] 7 sep 2020

Documents