What is Source Separation?¶
Source Separation is the process of isolating individual sounds in an auditory mixture of multiple sounds. [VVG18,CFL+18,RLStoter+18] We call each sound heard in a mixture a source. For example, we might want to isolate a singer from the background music to make a karaoke version of a song or isolate the bass guitar from the from the rest of the band so a musician can learn the part. Put another way, given a mixture of multiple sources, how can we recover only the source signals we’re interested in?
Mathematically, we assume that a mixture signal \(y(t)\) is composed of \(N\) sources, \(x_n(t)\), for \(n=1...N\), such that
Given only \(y(t)\), the goal of a source separation system is to recover one or more \(x(t)\)’s.
It is often the case that there are more sources within the mixture than there are mixture signals. Because of this, source separation is considered an underdetermined problem, meaning that there are fewer observations (i.e., the mixture) than there are required outcomes (i.e., the desires source(s)). For example, if a stereo mixture contains a recording of a piano quartet (e.g. a piano, violin, viola, and cell) , for any desired source in the mixture we only have two observations (each channel of the stereo mix), therefore source separation would be a useful tool to isolate one of the sources (e.g., just the piano).
In this tutorial we will be focusing on music separation, or the process of isolating at least one musical instrument or singer from a musical mixture that contains one or more other musical instruments or singers. Music is seen as a distinct problem from other types of source separation because there are many factors that make it uniquely challenging [CFL+18]:
Sources in music are highly correlated, meaning that all of the sources usually change together at the same time. For example, in a rock band if the bass guitar changes its note at the start of a new measure, it is likely that the other instruments will change as well.
Music is mixed and processed in ways that are aphysical and non-linear. Contemporary recording practices are such that any given source in a mixture might never occur naturally in the real world. Reverb, filtering, and other non-linear signal processing techniques all make music separation difficult, and yet these are tools that recording engineers and musicians routinely use to create music. This is a problem because we rarely, if ever, know what processing was applied to any source or the whole mixture.
If the result of a music source separation system is used in an end-user system, the bar for quality is much higher. As we will see, there are now many systems for musicians and sound engineers that incorporate source separation. As opposed to source separation being used as an intermediate step for another auditory process, source separation is the end goal and being listened to by users that might expect high quality results. Therefore it is paramount that the results of the system sound good enough for those users.1
What specific sounds constitute a source is highly dependent on the application and desired output. Some the approaches make explicit assumptions about what a source actually is, whereas others do not. For instance, a dog barking might be background noise in one scenario, and therefore ignored, but for a sound event detection system it might be a source of interest. Source separation research has largely assumed that sources and source types are known a priori.
Why Use Source Separation?¶
There are many reasons to study source separation. One might be interested in using existing methods to enhance a downstream task, or one might be interested in source separation as a pursuit in itself.
There are many demonstrated uses for music source separation within the field of Music Information Retrieval (MIR). In many scenarios, researchers have discovered that it is easier to process a isolated sources than mixtures of those sources. For example, source separation has been used to enhance:
lyric and music alignment [FGO+06],
musical instrument detection [HKV09],
lyric recognition [MV10],
vocal activity detection [SED18a],
fundamental frequency estimation [JBEW19], and
Additionally, source separation has long been seen as an inherently worthwhile endeavor on its own merits, with many thousands of research papers appearing over the past few decades and more appearing every year.
Whether you plan to create new source separation research or use existing methods to advance your own work, we hope this tutorial will provide you with a solid foundation understanding this field.
In the next section we will provide a brief overview of the open source landscape before diving into the basics of source separation.
Some creative applications might not have such strict demands; when using source separation to create remixes for instance, the artifacts might be masked by other sources in the mix.