Generating mixtures with Scaper

In this section we will learn how to generate training data for music source separation using the Scaper python library. Parts of this section are based on the Scaper tutorial.

Why Scaper?

Before we dive in, you might be wondering why we need a Python library to mix stems at all - can’t we just sum them in Python? What about data loaders provided by deep learning frameworks such as PyTorch and TensorFlow - can’t we just use those?

While there are various ways to programatically generate mixes, we will see that Scaper is particularly well suited for this task, offering a number of benefits that make it much preferable to simple mixing via ad-hoc code:

  • Scaper supports complex, programatic, and stochastic mixing pipelines:

    • For example, it can sample mixing parameters (such as per-stem SNR and augmentation parameters) from a variety of distributions. This allows you to generate a potentially infinite number of unique mixtures from the same set of stems.

  • Scaper supports data augmentation and normalization

    • Scaper includes data augmentation operations such as pitch shifting and time-stretching. It also provides easy-to-use options such as fix_clipping so that no matter how many stems you mix together or what SNR values you use, you can be sure your audio does not distort.

  • Scaper pipelines are reproducible

    • Scaper generates detailed annotation files - you can re-create an entire dataset of mixtures just from Scaper’s annotations as long as you have access to the stems. Furthermore, Scaper code can be initialized with a random seed, such that code run with the same seed always generates the same sequence of randomized mixtures. This allows you to share your scaper code + stems as a “recipe” for generating data: there’s no need to the actual (heavy) mixtures!

  • Scaper is optimized for performance

    • Scaper can generate training data on the fly for GPU training without being a bottleneck. It’s also good for batch generation, for example, on a machine with 8 CPUs, Scaper can generate 20,000 ten-second mixtures (mix + stems + annotations) in under 10 minutes.

  • Scaper can generate data for other audio domains

    • Scaper can generate speech/noise mixtures for training Automatic Speech Recognition (ASR), environmental soundscapes for sound event detection (SED) and classification (SEC), bioacoustic mixtures for species classification, etc. Once you know how to use Scaper for one audio domain, you know how to use it for all audio domains.

  • Scaper is documented, tested, actively maintained and updated

    • Will your ad-hoc mixing code work a few years from now? Will you remember how to use it? Will someone else know how to use it? Does it cover all corner cases? Can it be easily extended to support new features? Ad-hoc mixing code might seem like a time saver in the short term, but it’s bound to be a time sink in the long run.

Scaper overview

Scaper can be viewed as a programatic audio mixer. At a high-level, the input to Scaper is:

  1. source material: audio recordings you want to mix together (“soundbank” in the diagram below).

  2. event specification: a “recipe” for how to mix the recordings.

Scaper takes these and generates mixtures using the source material by following the event specification. Since the event specification can be probabilistic, multiple different mixtures can be generated from the same source material and event specification. For each generated mixture Scaper outputs:

  1. The mixture audio signal (“soundscape” in the diagram).

  2. The mixture annotation in JAMS format (detailed) and in a simpligied tabular format (python list or csv).

  3. The audio of each processed stem (or sound event) used to create the mixture.

../_images/scaper_diagram.png

Fig. 47 Block diagram of automatic mixing via Scaper.

Read more

You can learn more about Scaper by reading the scaper-paper: Scaper: A library for soundscape synthesis and augmentation

@inproceedings{Salamon:Scaper:WASPAA:17,
  author       = {Salamon, J. and MacConnell, D. and Cartwright, M. and Li, P. and Bello, J.~P.},
  title        = {Scaper: A Library for Soundscape Synthesis and Augmentation},
  booktitle.   = {IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  month        = {Oct.},
  year         = {2017},
  pages        = {344--348}
}

Please cite this paper if you use Scaper in your work. You do not need to read the paper to complete this tutorial.

Installing Scaper, nussl, and other dependencies

We can easily install Scaper and nussl via:

  • pip install scaper

  • pip install nussl

We’ll also install the common utility library we have created for this tutorial so we can easily visualize our data:

%%capture
!pip install scaper
!pip install nussl
!pip install git+https://github.com/source-separation/tutorial

To keep the tutorial page clean, we’ll hide Python warnings:

# To keep things clean we'll hide all warnings
import warnings
warnings.filterwarnings('ignore')

Preparing the source material

Download MUSDB18 training clips

We’ll use the stems of the MUSDB18 dataset as our source material. Let’s start by downloading the data - as explained in The MUSDB18 dataset we’ll be using 7-second preview clips of the dataset rather than the full dataset.

As we did earlier, let’s download the train set data via nussl:

(don’t worry if you get a warning from SoX, everything will still work as expected)

import nussl

musdb_train = nussl.datasets.MUSDB18(download=True, subsets=['train'])

# Confirm there are 94 training clips
len(musdb_train.items)
94

Prepare the source material for Scaper

Scaper expects the source material to be grouped into two categories: foreground files and background files. When Scaper is used to generate soundscapes (e.g., for sound event detection), this distinction is used to separate material used for background tracks and material used for foreground sound events (for further details about this distinction see this section of the Scaper tutorial).

Since we’ll be using Scaper to generate music mixtures, however, we do not require this foreground/background distinction and so we will place all of our stems in the foreground folder.

For music source separation, Scaper expects the following file structure:

foreground/
|--- drums/
     |--- stem.wav
|--- bass/
     |--- stem.wav
|--- vocals/
     |--- stem.wav
|--- other/
     |--- stem.wav
     
background/
(empty)

That is, inside the foreground folder there should be a folder per stem type (drums, bass, vocals, other), and within each stem (label) folder we store the audio file for the corresponding stem.

Note

Generally in Scaper the name of the stem file does not matter as long as its an audio file with a .wav extension. For music source separation, however, it is helpful to choose a meaningful filename that we can use to link stems from the same song. A good convention (that we will use) is to give all stems from the same song the same filename (remember the stems go in different folders), for example song_name.wav.

To make this step easy and quick, we have provided code below to arrange the MUSDB18 training clips in the folder structure expected by Scaper. This code will:

  • Iterate over each item (track) in the train set

  • For each track, iterate over its sources (stems)

  • For each stem, create a folder for the stem if one doesn’t exist already (drums, bass, vocals, other)

  • For each stem, place the corresponding audio file in the stem folder and use the song name as the filename

from pathlib import Path

# create foreground folder
fg_folder = Path('~/.nussl/ismir2020-tutorial/foreground').expanduser()  
fg_folder.mkdir(parents=True, exist_ok=True)                             

# create background folder - we need to provide one even if we don't use it
bg_folder = Path('~/.nussl/ismir2020-tutorial/background').expanduser()
bg_folder.mkdir(parents=True, exist_ok=True)
# For each item (track) in the train set, iterate over its sources (stems),
# create a folder for the stem if it doesn't exist already (drums, bass, vocals, other) 
# and place the stem audio file in this folder, using the song name as the filename
for item in musdb_train:
    song_name = item['mix'].file_name
    for key, val in item['sources'].items():
        src_path = fg_folder / key 
        src_path.mkdir(exist_ok=True)
        src_path = str(src_path / song_name) + '.wav'
        val.write_audio_to_file(src_path)

Now we have a folder called foreground, inside of which there are four stem folders: bass, drums, vocals, other, and inside each of these folders we have the audio files for all matching stems. I.e., in the bass folder we will have the bass stems from all the songs in the dataset, in the drums folder we’ll have the drum stems from all songs, etc. We’ve renamed each stem file to the name of the song it belongs to. Let’s verify this:

import os
import glob

for folder in os.listdir(fg_folder):
    if folder[0] != '.':  # ignore system folders
        stem_files = os.listdir(os.path.join(fg_folder, folder))
        print(f"\n{folder}\tfolder contains {len(stem_files)} audio files:\n")
        for sf in sorted(stem_files)[:5]:
            print(f"\t\t{sf}")
        print("\t\t...")
drums	folder contains 94 audio files:

		A Classic Education - NightOwl.wav
		ANiMAL - Clinic A.wav
		ANiMAL - Easy Tiger.wav
		ANiMAL - Rockshow.wav
		Actions - Devil's Words.wav
		...

vocals	folder contains 94 audio files:

		A Classic Education - NightOwl.wav
		ANiMAL - Clinic A.wav
		ANiMAL - Easy Tiger.wav
		ANiMAL - Rockshow.wav
		Actions - Devil's Words.wav
		...

other	folder contains 94 audio files:

		A Classic Education - NightOwl.wav
		ANiMAL - Clinic A.wav
		ANiMAL - Easy Tiger.wav
		ANiMAL - Rockshow.wav
		Actions - Devil's Words.wav
		...

bass	folder contains 94 audio files:

		A Classic Education - NightOwl.wav
		ANiMAL - Clinic A.wav
		ANiMAL - Easy Tiger.wav
		ANiMAL - Rockshow.wav
		Actions - Devil's Words.wav
		...

Note

The name of each stem audio file matches the name of the song to which it belongs. We will use this later to create “coherent mixtures”, i.e., music mixtures where all the stems come from the same song and are temporally aligned.

Defining a Scaper object

After organizing our source material, the next step is to create a Scaper object.

Upon creating the object, we will define:

  • duration: the duration in seconds of the mixtures this Scaper will generate (fixed once defined)

  • fg_path, bg_path: the paths to the foreground and background folders with the source material we have prepared and organized

  • random_state: a seed for initializing this Scaper object’s random state.

Let’s create a Scaper that generates 5-second mixtures:

import scaper

seed = 123 # integer or np.random.RandomState(<integer>)

sc = scaper.Scaper(
    duration=5.0,
    fg_path=str(fg_folder),
    bg_path=str(bg_folder),
    random_state=seed
)

Note

We will use this Scaper to generate a randomized sequence of mixtures. By seeding the Scaper object’s random state, this sequence will be exactly the same each time we run our code. This guarantees that our data generation pipeline is fully reproducible (by others and by ourselves). Changing the seed will result in a different sequence of random mixtures.

Next let’s set some key parameters:

  • sr: the sample rate for the output audio

  • n_channels: the number of channels for the output audio (1 = mono)

  • ref_db: a reference loudness in decibels (dB) for generating the mixture.

sc.sr = 44100
sc.n_channels = 1
sc.ref_db = -20

Note

When we add stems to the mixture later on, we will choose a Signal-to-Noise Ratio (SNR) in dB relative to the reference loudness ref_db. For example, if ref_db is -20, and we mix a stem with an SNR value of 10, then it will have a dB value of -10. A high ref_db value will produce loud mixtures, and a low value will produce soft mixtures.

Note

All absolute loudness values in Scaper are measured in LUFS (Loudness units relative to full scale) - a standard loudness measurement unit used for audio normalization in broadcast television systems and other video and music streaming services. For relative loudness, a difference of 1 LUFS is equal to 1 dB. For convenience, we’ll use dB to refer both to absolute and relative loudness in this section.

Adding events (stems)

Next we need to add stems to our mixture. In Scaper we do this by adding “events”, using the add_event function.

For each event that we add we specify the following:

  • label: the type of event (in our case drums, bass, vocals or other)

  • source_file: which audio file to use from all files matching the provided label

  • source_time: time offset for sampling the stem audio file, i.e., where to start in the source audio

  • event_time: offset for the start time of the event in the generated mixture

  • event_duration: how long the event should last in the mixture

  • snr: the event’s signal-to-noise ratio relative to ref_db

  • pitch_shift: whether to apply a pitch shifting augmentation to the event and if so by how much

  • time_stretch: whether to apply a time stretching augmentation to the event and if so by how much

Probabilistic event parameters

If we set each event parameter to a constant value (which we could in principle), the Scaper object would always generate the same mixture, which isn’t very helpful. Instead, for each parameter we can specify a distribution to sample from, such that each time we call sc.generate(...) later on, the value of the parameter will be different.

By setting event parameters to distributions, we obtain a probabilistic event specification. When we call sc.generate(...), a mixture will be “instantiated” by sampling a value for every parameter from the distribution we have specified for it.

In Scaper, distributions are defined using “distribution tuples”:

  • ('const', value): a constant, given by value.

  • ('choose', list): uniformly sample from a finite set of values given by list.

  • ('uniform', min, max): sample from a uniform distribution between min and max.

  • ('normal', mean, std): sample from a normal distribution with mean mean and standard deviation std.

  • ('truncnorm', mean, std, min, max): sample from a truncated normal distribution with mean mean and standard deviation std, limited to values between min and max.

This is one of Scaper’s key features - the ability to add stochastic (i.e., randomized) events. The same event specification can generate infinitely many different mixtures, because the event parameters are sampled from distributions.

Let’s add one event per stem type in a loop:

labels = ['vocals', 'drums', 'bass', 'other']

for label in labels:
    sc.add_event(label=('const', label),                # set the label value explicitly using a constant
                 source_file=('choose', []),            # choose the source file randomly from all files in the folder
                 source_time=('uniform', 0, 7),         # sample the source (stem) audio starting at a time between 0-7
                 event_time=('const', 0),               # always add the stem at time 0 in the mixture
                 event_duration=('const', sc.duration), # set the stem duration to match the mixture duration
                 snr=('uniform', -5, 5),                # choose an SNR for the stem uniformly between -5 and 5 dB
                 pitch_shift=('uniform', -2, 2),        # apply a random pitch shift between -2 and 2 semitones
                 time_stretch=('uniform', 0.8, 1.2))    # apply a random time stretch between 0.8 (faster) and 1.2 (slower)

Tip

It is very important to choose sensible ranges for your parameter distributions. In the example above we have set the snr range to [-5, 5], meaning sources will be at most 10 db louder than each other. Similarity, we have limited the range of pitch shifts to [-2, 2] semitones. Extreme parameter values can to lead to unrealistic mixtures.

Generating data

Now that we have added events to our Scaper object, we can call sc.generate(): this will “instatiate” (sample concrete values from) the specification and use them to generate a mixture. Each call to sc.generate() will create different instatiation of the events’ parameters and thus generate a different mixture.

mixture_audio, mixture_jam, annotation_list, stem_audio_list = sc.generate()

Note

When we call generate(), Scaper will raise warnings if the sampled event parameters cannot be satisfied (e.g. we request an event_duration of 100 seconds but the source_file is only 10 seconds long), and let us know how it adjusted the values so that they can be satisfied. To keep things clean we have disabled these warnings in this section. Once you copmlete this section you may want re-run this notebook without disabling warnings to examine the warnings that Scaper issues.

Let’s understand the output from generate():

  • mixture_audio: the audio data of the generated mixture in numpy ndarray format

  • mixture_jam: the mixture annotation, in JAMS format

  • annotation_list: a simplified mixture annotation in list format (we will not use this)

  • stem_audio_list: a python list containing the audio of the individual stems (events) in the mixture

Let’s inspect the JAMS annotation, where we can find the values Scaper has sampled for the parameters of the four events we have added. The complete set of parameters is specified in the “value” field of the annotated event.

# extract the annotation data from the JAMS object
ann = mixture_jam.annotations.search(namespace='scaper')[0]

# print the sampled parameters for each event in the annotation
for event in ann:
    print(f"\n{event.value['label']}:\n")
    print(event.value)
vocals:

OrderedDict([('label', 'vocals'), ('source_file', '/home/runner/.nussl/ismir2020-tutorial/foreground/vocals/Port St Willow - Stay Even.wav'), ('source_time', 0.9942668075895806), ('event_time', 0), ('event_duration', 5.0), ('snr', 2.1946896978556305), ('role', 'foreground'), ('pitch_shift', -0.30757415950215616), ('time_stretch', 0.8907405814256812)])

drums:

OrderedDict([('label', 'drums'), ('source_file', '/home/runner/.nussl/ismir2020-tutorial/foreground/drums/Traffic Experiment - Sirens.wav'), ('source_time', 1.0454477739410935), ('event_time', 0), ('event_duration', 5.0), ('snr', -3.600492373696155), ('role', 'foreground'), ('pitch_shift', -0.39592977335518587), ('time_stretch', 0.9643697490836076)])

bass:

OrderedDict([('label', 'bass'), ('source_file', '/home/runner/.nussl/ismir2020-tutorial/foreground/bass/Dreamers Of The Ghetto - Heavy Love.wav'), ('source_time', 0.10762590642068458), ('event_time', 0), ('event_duration', 5.0), ('snr', -1.0195574466956856), ('role', 'foreground'), ('pitch_shift', 0.9519816229281428), ('time_stretch', 0.9754288978718497)])

other:

OrderedDict([('label', 'other'), ('source_file', '/home/runner/.nussl/ismir2020-tutorial/foreground/other/Music Delta - Beatles.wav'), ('source_time', 0.15099625329503377), ('event_time', 0), ('event_duration', 5.0), ('snr', 2.1233018138876245), ('role', 'foreground'), ('pitch_shift', -0.288546036648722), ('time_stretch', 0.9761028712256306)])

Note

In Jupyter notebooks, JAMS annotations can be visualized interactively, but this cannot be displayed in the static book format. We encourage you to run this section as a jupyter notebook to explore the annotation data interactively.

# Launch this cell in a Jupyter notebook to generate an interactive vizualization of the annotation!
ann
Annotator 
Version 
Corpus 
Name 
Email 
Annotation tools 
Annotation rules 
Data source 
Validation 
allow_repeated_label  True
allow_repeated_source  True
audio_path  None
bg_labels 
bg_path  /home/runner/.nussl/ismir2020-tutorial/background
bg_spec 
disable_instantiation_warnings  False
disable_sox_warnings  True
duration  5.0
fade_in_len  0.01
fade_out_len  0.01
fg_labels  bassdrumsothervocals
fg_path  /home/runner/.nussl/ismir2020-tutorial/foreground
fg_spec  EventSpec(label=('const', 'vocals'), source_file=('choose', []), source_time=('uniform', 0, 7), event_time=('const', 0), event_duration=('const', 5.0), snr=('uniform', -5, 5), role='foreground', pitch_shift=('uniform', -2, 2), time_stretch=('uniform', 0.8, 1.2))EventSpec(label=('const', 'drums'), source_file=('choose', []), source_time=('uniform', 0, 7), event_time=('const', 0), event_duration=('const', 5.0), snr=('uniform', -5, 5), role='foreground', pitch_shift=('uniform', -2, 2), time_stretch=('uniform', 0.8, 1.2))EventSpec(label=('const', 'bass'), source_file=('choose', []), source_time=('uniform', 0, 7), event_time=('const', 0), event_duration=('const', 5.0), snr=('uniform', -5, 5), role='foreground', pitch_shift=('uniform', -2, 2), time_stretch=('uniform', 0.8, 1.2))EventSpec(label=('const', 'other'), source_file=('choose', []), source_time=('uniform', 0, 7), event_time=('const', 0), event_duration=('const', 5.0), snr=('uniform', -5, 5), role='foreground', pitch_shift=('uniform', -2, 2), time_stretch=('uniform', 0.8, 1.2))
fix_clipping  False
isolated_events_audio_path 
isolated_events_path  None
jams_path  None
n_channels  1
n_events  4
no_audio  False
original_duration  5.0
peak_normalization  False
peak_normalization_scale_factor  1.0
polyphony_gini  0.9528502488570024
polyphony_max  4
protected_labels 
quick_pitch_time  False
ref_db  -20
ref_db_change  0
ref_db_generated  -20
reverb  None
save_isolated_events  False
scaper_version  1.6.4
soundscape_audio_path  None
sr  44100
txt_path  None
txt_sep 
time duration value confidence
0 0.000 4.454
label vocals
source_file /home/runner/.nussl/ismir2020-tutorial/foreground/vocals/Port St Willow - Stay Even.wav
source_time 0.9942668075895806
event_time 0
event_duration 5.0
snr 2.1946896978556305
role foreground
pitch_shift -0.30757415950215616
time_stretch 0.8907405814256812
1.0
1 0.000 4.822
label drums
source_file /home/runner/.nussl/ismir2020-tutorial/foreground/drums/Traffic Experiment - Sirens.wav
source_time 1.0454477739410935
event_time 0
event_duration 5.0
snr -3.600492373696155
role foreground
pitch_shift -0.39592977335518587
time_stretch 0.9643697490836076
1.0
2 0.000 4.877
label bass
source_file /home/runner/.nussl/ismir2020-tutorial/foreground/bass/Dreamers Of The Ghetto - Heavy Love.wav
source_time 0.10762590642068458
event_time 0
event_duration 5.0
snr -1.0195574466956856
role foreground
pitch_shift 0.9519816229281428
time_stretch 0.9754288978718497
1.0
3 0.000 4.881
label other
source_file /home/runner/.nussl/ismir2020-tutorial/foreground/other/Music Delta - Beatles.wav
source_time 0.15099625329503377
event_time 0
event_duration 5.0
snr 2.1233018138876245
role foreground
pitch_shift -0.288546036648722
time_stretch 0.9761028712256306
1.0

sc.generate() provides a large set of configuration parameters that impact the generated audio. In this tutorial we’ll discuss two important parameters:

  • fix_clipping: when True, ensures the mixture does not clip by peak normalizing the mixture and adjusting the loudness of its constituent stems accordingly.

  • quick_pitch_time: when True, uses faster but lower-quality algorithms for pitch shifting and time stretching, which can dramatically speed up data generation. In this section we’ll leave this parameter set to False, but later in this tutorial we will use it to maximize training speed.

Tip

It is important to set fix_clipping=True in when calling sc.generate(), otherwise our mixtures can distort.

Let’s generate another mixture using the same Scaper object:

mixture_audio, mixture_jam, annotation_list, stem_audio_list = sc.generate(fix_clipping=True)

We can inspect the annotation to see the sampled mixture parameters are all different:

ann = mixture_jam.annotations.search(namespace='scaper')[0]

for event in ann:
    print(f"\n{event.value['label']}:\n")
    print(event.value)
vocals:

OrderedDict([('label', 'vocals'), ('source_file', '/home/runner/.nussl/ismir2020-tutorial/foreground/vocals/Night Panther - Fire.wav'), ('source_time', 1.7507239493214413), ('event_time', 0), ('event_duration', 4.386839518664663), ('snr', 1.1102351067758285), ('role', 'foreground'), ('pitch_shift', 0.8897735302808862), ('time_stretch', 1.1397727176311159)])

drums:

OrderedDict([('label', 'drums'), ('source_file', '/home/runner/.nussl/ismir2020-tutorial/foreground/drums/Traffic Experiment - Once More (With Feeling).wav'), ('source_time', 0.4116605732032902), ('event_time', 0), ('event_duration', 5.0), ('snr', -2.0628595361117066), ('role', 'foreground'), ('pitch_shift', 0.523904495417951), ('time_stretch', 0.9447154622489257)])

bass:

OrderedDict([('label', 'bass'), ('source_file', '/home/runner/.nussl/ismir2020-tutorial/foreground/bass/Skelpolu - Human Mistakes.wav'), ('source_time', 0.7770380338578964), ('event_time', 0), ('event_duration', 5.0), ('snr', -0.06314902349693785), ('role', 'foreground'), ('pitch_shift', -0.2966788388166881), ('time_stretch', 0.9734804690718113)])

other:

OrderedDict([('label', 'other'), ('source_file', '/home/runner/.nussl/ismir2020-tutorial/foreground/other/Bill Chudziak - Children Of No-one.wav'), ('source_time', 1.611179748776235), ('event_time', 0), ('event_duration', 5.0), ('snr', 4.441600182038796), ('role', 'foreground'), ('pitch_shift', 0.007346703537346233), ('time_stretch', 0.9705405227851233)])
# Launch this cell in a Jupyter notebook to generate an interactive vizualization of the annotation!
ann
Annotator 
Version 
Corpus 
Name 
Email 
Annotation tools 
Annotation rules 
Data source 
Validation 
allow_repeated_label  True
allow_repeated_source  True
audio_path  None
bg_labels 
bg_path  /home/runner/.nussl/ismir2020-tutorial/background
bg_spec 
disable_instantiation_warnings  False
disable_sox_warnings  True
duration  5.0
fade_in_len  0.01
fade_out_len  0.01
fg_labels  bassdrumsothervocals
fg_path  /home/runner/.nussl/ismir2020-tutorial/foreground
fg_spec  EventSpec(label=('const', 'vocals'), source_file=('choose', []), source_time=('uniform', 0, 7), event_time=('const', 0), event_duration=('const', 5.0), snr=('uniform', -5, 5), role='foreground', pitch_shift=('uniform', -2, 2), time_stretch=('uniform', 0.8, 1.2))EventSpec(label=('const', 'drums'), source_file=('choose', []), source_time=('uniform', 0, 7), event_time=('const', 0), event_duration=('const', 5.0), snr=('uniform', -5, 5), role='foreground', pitch_shift=('uniform', -2, 2), time_stretch=('uniform', 0.8, 1.2))EventSpec(label=('const', 'bass'), source_file=('choose', []), source_time=('uniform', 0, 7), event_time=('const', 0), event_duration=('const', 5.0), snr=('uniform', -5, 5), role='foreground', pitch_shift=('uniform', -2, 2), time_stretch=('uniform', 0.8, 1.2))EventSpec(label=('const', 'other'), source_file=('choose', []), source_time=('uniform', 0, 7), event_time=('const', 0), event_duration=('const', 5.0), snr=('uniform', -5, 5), role='foreground', pitch_shift=('uniform', -2, 2), time_stretch=('uniform', 0.8, 1.2))
fix_clipping  True
isolated_events_audio_path 
isolated_events_path  None
jams_path  None
n_channels  1
n_events  4
no_audio  False
original_duration  5.0
peak_normalization  False
peak_normalization_scale_factor  0.7962021357006797
polyphony_gini  0.9724053568916263
polyphony_max  4
protected_labels 
quick_pitch_time  False
ref_db  -20
ref_db_change  -1.979533236325124
ref_db_generated  -21.979533236325125
reverb  None
save_isolated_events  False
scaper_version  1.6.4
soundscape_audio_path  None
sr  44100
txt_path  None
txt_sep 
time duration value confidence
0 0.000 5.000
label vocals
source_file /home/runner/.nussl/ismir2020-tutorial/foreground/vocals/Night Panther - Fire.wav
source_time 1.7507239493214413
event_time 0
event_duration 4.386839518664663
snr 1.1102351067758285
role foreground
pitch_shift 0.8897735302808862
time_stretch 1.1397727176311159
1.0
1 0.000 4.724
label drums
source_file /home/runner/.nussl/ismir2020-tutorial/foreground/drums/Traffic Experiment - Once More (With Feeling).wav
source_time 0.4116605732032902
event_time 0
event_duration 5.0
snr -2.0628595361117066
role foreground
pitch_shift 0.523904495417951
time_stretch 0.9447154622489257
1.0
2 0.000 4.867
label bass
source_file /home/runner/.nussl/ismir2020-tutorial/foreground/bass/Skelpolu - Human Mistakes.wav
source_time 0.7770380338578964
event_time 0
event_duration 5.0
snr -0.06314902349693785
role foreground
pitch_shift -0.2966788388166881
time_stretch 0.9734804690718113
1.0
3 0.000 4.853
label other
source_file /home/runner/.nussl/ismir2020-tutorial/foreground/other/Bill Chudziak - Children Of No-one.wav
source_time 1.611179748776235
event_time 0
event_duration 5.0
snr 4.441600182038796
role foreground
pitch_shift 0.007346703537346233
time_stretch 0.9705405227851233
1.0

Incoherent mixing

So far we’ve only inspected the annotation… let’s listen to our generated audio!

from IPython.display import Audio, display

display(Audio(data=mixture_audio.T, rate=sc.sr))

…Wait! WHAT WAS THAT? CHAOS! CACOPHONY!

That’s because we just generated an incoherent mixture, i.e., a mixture where the stems are not necessarily from the same song, and even if they are, they are not necessarily temporally aligned:

../_images/incoherent_vs_coherent_mixing.png

Fig. 48 Incoherent mixing vs coherent mixing.

We can verify this by listening to the individual stems:

# extract the annotation data from the JAMS object
ann = mixture_jam.annotations.search(namespace='scaper')[0]

# iterate over the annotation and corresponding stem audio data
for obs, stem_audio in zip(ann.data, stem_audio_list):
    print(f"Instrument: {obs.value['label']} at SNR: {obs.value['snr']:.2f}")
    display(Audio(data=stem_audio.T, rate=sc.sr))
Instrument: vocals at SNR: 1.11