This is the accompanying web page for the publication "Towards Musically Meaningful Explanations Using Source Separation". It provides audio examples for Section 4 "Basic Validation of audioLIME" and Section 5 "Demonstration on Real-World Data: Automated Music Tagging".
Note: In Firefox some of the audio controls display incorrect durations (this is a known bug). But the audios are played correctly.
from demo_utils import display_audio, display_markdown
from demo_utils import SAMPLE_RATE, MUSICNN_SAMPLE_LENGTH
from demo_utils import get_top_explaining_component
from IPython.display import Image
Here we show a few examples (10 second snippets) from the dataset for the "Guitar vs. Piano" task. In the training, validation and standard test set each "Guitar" example contains guitar, bass, and drums, and each "Piano" example contains piano and bass. In the bad test set each "Guitar" example contains guitar and bass, and each "Piano" example contains piano, bass, and drums.
Next to the track name you see the predictions by the model GP$_{\text{bass}}$ (averaged over the 5 2-second windows without overlap).
Guitar (=0) | Piano (=1) | |
---|---|---|
train | Track00005 ($\hat{y}=0.0$) |
Track00008 ($\hat{y}=0.996$) |
validation | Track01501 ($\hat{y}=0.0$) |
Track01504 ($\hat{y}=0.999$) |
test_std | Track02098 ($\hat{y}=0.0$) |
Track02095 ($\hat{y}=0.991$) |
test_bad | Track02098 ($\hat{y}=0.251$) |
Track02095 ($\hat{y}=0.0$) |
When we listen to the bottom right sample (test_bad / Track02095) we wonder why the model didn't recognize it as piano, and why it can be so sure that it is a guitar. We take the first 61 frames (2 seconds) since the model was trained on 2 seconds snippets.
from demo_utils import create_demo_spec, predict_instrument_fn, construct_spec
from demo_utils import cerberus_models_dir
from audioLIME.factorization import CerberusFactorization
audio_path = 'audio/02095_pianodrums_b_snippet.wav'
S_gp = create_demo_spec(audio_path)
Each factorization object for different models (Cerberus, OVA, Spleeter, ...) provides the same functionality. Differences are taken care of in the constructor, so we need to pass everything that the corresponding model needs for estimating the sources. For this example we are using a Cerberus model that estimates {piano, guitar, bass, drums}.
cerberus_model_name = 'Cerberus - pno/gtr/bss/drm'
cerberus_factorization = CerberusFactorization(cerberus_models_dir, cerberus_model_name, audio_path,
construct_spec,
start_sample=0, y_length=MUSICNN_SAMPLE_LENGTH)
target_idx = 0 # we want an explanation for guitar
top_component_gp = get_top_explaining_component(x=S_gp, predict_fn=predict_instrument_fn, labels=[target_idx], factorization=cerberus_factorization)
display_audio(top_component_gp)
Sounds like our model learned to confound drums with guitar!
from audioLIME.factorization import FakeMusDBFactorization, SpleeterFactorization
from demo_utils import prepare_musdb_demo_example, predict_fn_MSD_musicnn_big
MUSDB18 contains annotations for 9 genres, which are not directly comparable to the tags used for training musicnn_MSD_big, but we can compare the top predicted tag with the annotated genre to get a feeling of how good the tagger works on this dataset.
Image("musicnn_musdb.png")
We can see some obviously "correct" predictions, e.g. tagging 47 Rock/Pop songs as rock, tagging 6 Rap songs as Hip-Hop, tagging 2 Country songs as country and so on and so forth. For some genres (Singer/Songwriter) and tags (80s, female vocalists), it is not as obvious, but the predictions can be checked by looking up the songs, e.g. the prediction '80s' for the song 'Speak Softly - Broken Man' is incorrect, the song is from 2012. Overall MSD_musicnn_big performs reasonably well on this dataset.
Here we demonstrate the usage of audioLIME on two examples using (a) oracle separation and (b) separation with the model spleeter:5stems.
The top tag for this song is 'female vocalists'. For the demonstration we pick a snippet (since musicnn operates on 3 second snippets) whose prediction is also 'female vocalists' (starting at 25 seconds) and ask audioLIME for an explanation.
target_idx_female = 5
sample_start = 25*SAMPLE_RATE
y_snip_devils_words, mus_devils_words = prepare_musdb_demo_example(selected_idx=4, sample_start=sample_start)
musdb_factorization = FakeMusDBFactorization(mus_devils_words, sample_start, MUSICNN_SAMPLE_LENGTH, SAMPLE_RATE)
spleeter_factorization = SpleeterFactorization('spleeter:5stems', mus_devils_words.path, None, sample_start, MUSICNN_SAMPLE_LENGTH)
top_component_md_devils = get_top_explaining_component(x=y_snip_devils_words,
predict_fn=predict_fn_MSD_musicnn_big,
labels=[target_idx_female],
factorization=musdb_factorization, batch_size=1) # musicnn only takes 1 sample at a time
display_audio(top_component_md_devils, sr=SAMPLE_RATE)
top_component_spl_devils = get_top_explaining_component(x=y_snip_devils_words,
predict_fn=predict_fn_MSD_musicnn_big,
labels=[target_idx_female],
factorization=spleeter_factorization,
batch_size=1) # musicnn only takes 1 sample at a time
display_audio(top_component_spl_devils, sr=SAMPLE_RATE)
With both separation approaches we can hear that the singing voice in the song is responsible for the prediction "female vocalist", which is great! This increases our trust that the model learned something about this tag.
target_idx_hiphop = 33
sample_start = 47 * MUSICNN_SAMPLE_LENGTH # the 47th snippet has the highest prediction for Hip-Hop (0.817)
y_snip_easy_tiger, mus_easy_tiger = prepare_musdb_demo_example(selected_idx=2, sample_start=sample_start)
musdb_factorization = FakeMusDBFactorization(mus_easy_tiger, sample_start, MUSICNN_SAMPLE_LENGTH, SAMPLE_RATE)
spleeter_factorization = SpleeterFactorization('spleeter:5stems', mus_easy_tiger.path, None, sample_start, MUSICNN_SAMPLE_LENGTH)
top_component_md_tiger = get_top_explaining_component(x=y_snip_easy_tiger,
predict_fn=predict_fn_MSD_musicnn_big,
labels=[target_idx_hiphop],
factorization=musdb_factorization, batch_size=1)
display_audio(top_component_md_tiger, sr=SAMPLE_RATE)
top_component_spl_tiger = get_top_explaining_component(x=y_snip_easy_tiger,
predict_fn=predict_fn_MSD_musicnn_big,
labels=[target_idx_hiphop],
factorization=spleeter_factorization,
batch_size=1) # musicnn only takes 1 sample at a time
display_audio(top_component_spl_tiger, sr=SAMPLE_RATE)
Again, the singing voice in the song was responsible for making the prediction "Hip-Hop". This does not mean, however, that singing voice in general will increase the probability of the tag "Hip-Hop". It tells s that it is something related to vocal qualities that indicate this tag. As we may expect that characteristics of the vocals are strongly related to this genre, this increases our confidence that this model has actually learned something about Hip-Hop.
In this section we test audioLIME using spleeter:5stems as our separation model on a song found on YouTube. Unfortunately the video owner disabled playback on other websites, please click "Watch this video on YouTube."
The predictions and analyses are performed on 3-second snippets (because this is how the model is implemented), and each snippet is treated individually. This is different from temporal segmentation mentioned in the paper (see Section 2.2), which would segment each 3-second input snippet in the desired number of segments.
import IPython.display as ipd
ipd.YouTubeVideo('AzEBH6DZJVk', width=600, height=450)
We only show a subset of the 50 tags. We pick the ones that have at least one snippet > 0.3.
Image('img/taggram_hugh_laurie.png')
We pick 3 snippets from different parts of the song that have a high prediction for jazz (starting at 21, 206 and 386 seconds). In the first snippet (starting at 21 seconds) we can only hear the piano and some noise. In the second snippet (starting at 206) we can hear several instruments and in the third snippet (starting at 386 seconds) addtional to instruments we also hear singing voice.
Start (sec.) | Snippet | $p(\hat{y}=\text{'jazz'})$ | Explanation |
---|---|---|---|
21 | 0.63 | ||
206 | 0.95 | ||
386 | 0.76 |
The prediction for the first snippet is explained by the "piano", and not the noisy part of the audio. The second snippet is also explained by "piano" and not by the other instruments. The third snippet, although it contains distinctive singing voice, is also explained by the "piano" in it.
Now we have 3 very different snippets, in which the probability for "jazz" is very high and in each case the explanation is the "piano" part of the snippets. We believe that the model learned to relate the piano to the tag "jazz". This does not necessarily mean that the presence of piano leads to a high probability of "jazz" per se, but that the model learned to relate specific characteristics of how the piano sounds in those songs to the tag "jazz".