This is the accompanying web page for the publication "Towards Musically Meaningful Explanations Using Source Separation". It provides audio examples for Section 4 "Basic Validation of audioLIME" and Section 5 "Demonstration on Real-World Data: Automated Music Tagging".

Note: In Firefox some of the audio controls display incorrect durations (this is a known bug). But the audios are played correctly.

In [1]:
from demo_utils import display_audio, display_markdown
from demo_utils import SAMPLE_RATE, MUSICNN_SAMPLE_LENGTH
from demo_utils import get_top_explaining_component

from IPython.display import Image

Basic Validation of audioLIME (Section 4)

Dataset Creation & "Bad" Classifier

Here we show a few examples (10 second snippets) from the dataset for the "Guitar vs. Piano" task. In the training, validation and standard test set each "Guitar" example contains guitar, bass, and drums, and each "Piano" example contains piano and bass. In the bad test set each "Guitar" example contains guitar and bass, and each "Piano" example contains piano, bass, and drums.

Next to the track name you see the predictions by the model GP$_{\text{bass}}$ (averaged over the 5 2-second windows without overlap).

Guitar (=0) Piano (=1)
train

Track00005 ($\hat{y}=0.0$)

Track00008 ($\hat{y}=0.996$)

validation

Track01501 ($\hat{y}=0.0$)

Track01504 ($\hat{y}=0.999$)

test_std

Track02098 ($\hat{y}=0.0$)

Track02095 ($\hat{y}=0.991$)

test_bad

Track02098 ($\hat{y}=0.251$)

Track02095 ($\hat{y}=0.0$)

Detecting the Confounding Class

When we listen to the bottom right sample (test_bad / Track02095) we wonder why the model didn't recognize it as piano, and why it can be so sure that it is a guitar. We take the first 61 frames (2 seconds) since the model was trained on 2 seconds snippets.

In [2]:
from demo_utils import create_demo_spec, predict_instrument_fn, construct_spec
from demo_utils import cerberus_models_dir

from audioLIME.factorization import CerberusFactorization
In [3]:
audio_path = 'audio/02095_pianodrums_b_snippet.wav'
S_gp = create_demo_spec(audio_path)

Each factorization object for different models (Cerberus, OVA, Spleeter, ...) provides the same functionality. Differences are taken care of in the constructor, so we need to pass everything that the corresponding model needs for estimating the sources. For this example we are using a Cerberus model that estimates {piano, guitar, bass, drums}.

In [4]:
cerberus_model_name = 'Cerberus - pno/gtr/bss/drm'
cerberus_factorization = CerberusFactorization(cerberus_models_dir, cerberus_model_name, audio_path,
                                               construct_spec,
                                               start_sample=0, y_length=MUSICNN_SAMPLE_LENGTH)

target_idx = 0 # we want an explanation for guitar
In [5]:
top_component_gp = get_top_explaining_component(x=S_gp, predict_fn=predict_instrument_fn, labels=[target_idx], factorization=cerberus_factorization)

Now we are listening to the explanation for "guitar"

In [6]:
display_audio(top_component_gp)

Sounds like our model learned to confound drums with guitar!

Demonstration on Real-World Data: Automated Music Tagging

In this section we demonstrate audioLIME on a two real recordings from MUSDB18 and a song found on YouTube.

In [7]:
from audioLIME.factorization import FakeMusDBFactorization, SpleeterFactorization
from demo_utils import prepare_musdb_demo_example, predict_fn_MSD_musicnn_big

Qualitative comparison of musicnn_MSD_big tags and MUSDB18 genres

MUSDB18 contains annotations for 9 genres, which are not directly comparable to the tags used for training musicnn_MSD_big, but we can compare the top predicted tag with the annotated genre to get a feeling of how good the tagger works on this dataset.

In [8]:
Image("musicnn_musdb.png")
Out[8]:

We can see some obviously "correct" predictions, e.g. tagging 47 Rock/Pop songs as rock, tagging 6 Rap songs as Hip-Hop, tagging 2 Country songs as country and so on and so forth. For some genres (Singer/Songwriter) and tags (80s, female vocalists), it is not as obvious, but the predictions can be checked by looking up the songs, e.g. the prediction '80s' for the song 'Speak Softly - Broken Man' is incorrect, the song is from 2012. Overall MSD_musicnn_big performs reasonably well on this dataset.

MSD_musicnn_big on MUSDB18 with oracle separation & spleeter:5stems

Here we demonstrate the usage of audioLIME on two examples using (a) oracle separation and (b) separation with the model spleeter:5stems.

Actions - Devil's Words

The top tag for this song is 'female vocalists'. For the demonstration we pick a snippet (since musicnn operates on 3 second snippets) whose prediction is also 'female vocalists' (starting at 25 seconds) and ask audioLIME for an explanation.

In [9]:
target_idx_female = 5
sample_start = 25*SAMPLE_RATE
y_snip_devils_words, mus_devils_words = prepare_musdb_demo_example(selected_idx=4, sample_start=sample_start)

Input Audio

Top tags: ['female vocalists', 'pop', 'electronic']
In [10]:
musdb_factorization = FakeMusDBFactorization(mus_devils_words, sample_start, MUSICNN_SAMPLE_LENGTH, SAMPLE_RATE)
spleeter_factorization = SpleeterFactorization('spleeter:5stems', mus_devils_words.path, None, sample_start, MUSICNN_SAMPLE_LENGTH)

What does the explanation for 'female vocalists' sound like?

Oracle Separation

In [11]:
top_component_md_devils = get_top_explaining_component(x=y_snip_devils_words,
                                                       predict_fn=predict_fn_MSD_musicnn_big,
                                                       labels=[target_idx_female],
                                                       factorization=musdb_factorization, batch_size=1) # musicnn only takes 1 sample at a time
display_audio(top_component_md_devils, sr=SAMPLE_RATE)

Separation using spleeter:5stems

In [12]:
top_component_spl_devils = get_top_explaining_component(x=y_snip_devils_words,
                                                        predict_fn=predict_fn_MSD_musicnn_big,
                                                        labels=[target_idx_female],
                                                        factorization=spleeter_factorization, 
                                                        batch_size=1) # musicnn only takes 1 sample at a time


display_audio(top_component_spl_devils, sr=SAMPLE_RATE)

With both separation approaches we can hear that the singing voice in the song is responsible for the prediction "female vocalist", which is great! This increases our trust that the model learned something about this tag.

ANiMAL - Easy Tiger

In [13]:
target_idx_hiphop = 33
sample_start = 47 * MUSICNN_SAMPLE_LENGTH # the 47th snippet has the highest prediction for Hip-Hop (0.817)
y_snip_easy_tiger, mus_easy_tiger = prepare_musdb_demo_example(selected_idx=2, sample_start=sample_start)

Input Audio

Top tags: ['Hip-Hop', 'rock', 'jazz']
In [14]:
musdb_factorization = FakeMusDBFactorization(mus_easy_tiger, sample_start, MUSICNN_SAMPLE_LENGTH, SAMPLE_RATE)
spleeter_factorization = SpleeterFactorization('spleeter:5stems', mus_easy_tiger.path, None, sample_start, MUSICNN_SAMPLE_LENGTH)

What does the explanation for 'Hip-Hop' sound like?

Oracle Separation

In [15]:
top_component_md_tiger = get_top_explaining_component(x=y_snip_easy_tiger,
                                                       predict_fn=predict_fn_MSD_musicnn_big,
                                                       labels=[target_idx_hiphop],
                                                       factorization=musdb_factorization, batch_size=1)

display_audio(top_component_md_tiger, sr=SAMPLE_RATE)

Separation using spleeter:5stems

In [16]:
top_component_spl_tiger = get_top_explaining_component(x=y_snip_easy_tiger,
                                                        predict_fn=predict_fn_MSD_musicnn_big,
                                                        labels=[target_idx_hiphop],
                                                        factorization=spleeter_factorization, 
                                                        batch_size=1) # musicnn only takes 1 sample at a time

display_audio(top_component_spl_tiger, sr=SAMPLE_RATE)

Again, the singing voice in the song was responsible for making the prediction "Hip-Hop". This does not mean, however, that singing voice in general will increase the probability of the tag "Hip-Hop". It tells s that it is something related to vocal qualities that indicate this tag. As we may expect that characteristics of the vocals are strongly related to this genre, this increases our confidence that this model has actually learned something about Hip-Hop.

Hugh Laurie - Saint James Infirmary (Let Them Talk, A Celebration of New Orleans Blues)

In this section we test audioLIME using spleeter:5stems as our separation model on a song found on YouTube. Unfortunately the video owner disabled playback on other websites, please click "Watch this video on YouTube."

The predictions and analyses are performed on 3-second snippets (because this is how the model is implemented), and each snippet is treated individually. This is different from temporal segmentation mentioned in the paper (see Section 2.2), which would segment each 3-second input snippet in the desired number of segments.

In [17]:
import IPython.display as ipd
ipd.YouTubeVideo('AzEBH6DZJVk', width=600, height=450)
Out[17]:

We only show a subset of the 50 tags. We pick the ones that have at least one snippet > 0.3.

In [18]:
Image('img/taggram_hugh_laurie.png')
Out[18]:

We pick 3 snippets from different parts of the song that have a high prediction for jazz (starting at 21, 206 and 386 seconds). In the first snippet (starting at 21 seconds) we can only hear the piano and some noise. In the second snippet (starting at 206) we can hear several instruments and in the third snippet (starting at 386 seconds) addtional to instruments we also hear singing voice.

Start (sec.) Snippet $p(\hat{y}=\text{'jazz'})$ Explanation
21 0.63
206 0.95
386 0.76

The prediction for the first snippet is explained by the "piano", and not the noisy part of the audio. The second snippet is also explained by "piano" and not by the other instruments. The third snippet, although it contains distinctive singing voice, is also explained by the "piano" in it.

Now we have 3 very different snippets, in which the probability for "jazz" is very high and in each case the explanation is the "piano" part of the snippets. We believe that the model learned to relate the piano to the tag "jazz". This does not necessarily mean that the presence of piano leads to a high probability of "jazz" per se, but that the model learned to relate specific characteristics of how the piano sounds in those songs to the tag "jazz".