References¶

BSF94: Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
Bro91: Judith C Brown. Calculation of a constant q spectral transform. The Journal of the Acoustical Society of America, 89(1):425–434, 1991.
BP92: Judith C Brown and Miller S Puckette. An efficient algorithm for the calculation of a constant q transform. The Journal of the Acoustical Society of America, 92(5):2698–2701, 1992.
CFL+18: Estefania Cano, Derry FitzGerald, Antoine Liutkus, Mark D Plumbley, and Fabian-Robert Stöter. Musical source separation: an introduction. IEEE Signal Processing Magazine, 36(1):31–40, 2018.
CPMH16: Mark Cartwright, Bryan Pardo, Gautham J Mysore, and Matt Hoffman. Fast and easy crowdsourced perceptual audio evaluation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 619–623. IEEE, 2016.
CLM17: Zhuo Chen, Yi Luo, and Nima Mesgarani. Deep attractor network for single-microphone speaker separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 246–250. IEEE, 2017.
CKH+18: Hyeong-Seok Choi, Jang-Hyun Kim, Jaesung Huh, Adrian Kim, Jung-Woo Ha, and Kyogu Lee. Phase-aware speech enhancement with deep complex u-net. In International Conference on Learning Representations. 2018.
DefossezUBB19a: Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Demucs: deep extractor for music sources with extra unlabeled data remixed. arXiv preprint arXiv:1909.01174, 2019.
DefossezUBB19b: Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254, 2019.
EAC+18: Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. Gansynth: adversarial neural audio synthesis. In International Conference on Learning Representations. 2018.
EGR+19: Jesse Engel, Chenjie Gu, Adam Roberts, and others. Ddsp: differentiable digital signal processing. In International Conference on Learning Representations. 2019.
FBR12: Benoit Fuentes, Roland Badeau, and Gaël Richard. Blind harmonic adaptive decomposition applied to supervised source separation. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), 2654–2658. IEEE, 2012.
FGO+06: Hiromasa Fujihara, Masataka Goto, Jun Ogata, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G Okuno. Automatic synchronization between lyrics and music cd recordings based on viterbi alignment of segregated vocal signals. In Eighth IEEE International Symposium on Multimedia (ISM'06), 257–264. IEEE, 2006.
GSD12: Joachim Ganseman, Paul Scheunders, and Simon Dixon. Improving plca-based score-informed source separation with invertible constant-q transforms. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), 2634–2638. IEEE, 2012.
GL84: Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
GS10: David Gunawan and Deep Sen. Iterative phase estimation for the synthesis of separated sources from single-channel mixtures. IEEE Signal Processing Letters, 17(5):421–424, 2010.
HMW20a: Verena Haunschmid, Ethan Manilow, and Gerhard Widmer. Towards musically meaningful explanations using source separation. arXiv preprint arXiv:2009.02051, 2020.
HMW20b: Verena Haunschmid, Ethan Manilow, and Gerhard Widmer. Audiolime: listenable explanations using source separation. 13th International Workshop on Machine Learning and Music 2020, pages 20, 2020.
HKV09: Toni Heittola, Anssi Klapuri, and Tuomas Virtanen. Musical instrument recognition in polyphonic audio using source-filter model for sound separation. In ISMIR, 327–332. 2009.
HKVM20: Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5(50):2154, 2020. Deezer Research. URL: https://doi.org/10.21105/joss.02154, doi:10.21105/joss.02154.
HCLRW16: John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. Deep clustering: discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 31–35. IEEE, 2016.
HL15: Ying Hu and Guizhong Liu. Separation of singing voice using nonnegative matrix partial co-factorization for singer identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(4):643–653, 2015.
HL20: Yun-Ning Hung and Alexander Lerch. Multitask learning for instrument activation aware music source separation. arXiv preprint arXiv:2008.00616, 2020.
IS15: Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
JFB+11: Rajesh Jaiswal, Derry FitzGerald, Dan Barry, Eugene Coyle, and Scott Rickard. Clustering nmf basis functions using shifted nmf for monaural sound source separation. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 245–248. IEEE, 2011.
JBEW19: Andreas Jansson, Rachel M Bittner, Sebastian Ewert, and Tillman Weyde. Joint singing voice separation and f0 estimation with deep u-net architectures. In 2019 27th European Signal Processing Conference (EUSIPCO), 1–5. IEEE, 2019.
JHM+17: Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep u-net convolutional networks. International Society for Music Information Retrieval Conference, 2017, 2017.
KMHGomez20: Venkatesh S Kadandale, Juan F Montesinos, Gloria Haro, and Emilia Gómez. Multi-task u-net for music source separation. arXiv preprint arXiv:2003.10414, 2020.
LRWW+19: Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sarroff, and John R Hershey. Phasebook and friends: leveraging discrete representations for source separation. IEEE Journal of Selected Topics in Signal Processing, 13(2):370–382, 2019.
LRWEH19: Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R Hershey. Sdr–half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 626–630. IEEE, 2019.
LluisPS19: Francesc Lluís, Jordi Pons, and Xavier Serra. End-to-end music source separation: is it possible in the waveform domain? Proc. Interspeech 2019, pages 4619–4623, 2019.
LSC+18: Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee, Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Per-channel energy normalization: why and how. IEEE Signal Processing Letters, 26(1):39–43, 2018.
LCH+17: Yi Luo, Zhuo Chen, John R Hershey, Jonathan Le Roux, and Nima Mesgarani. Deep clustering and conventional networks for music separation: stronger together. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 61–65. IEEE, 2017.
LM18: Yi Luo and Nima Mesgarani. Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 696–700. IEEE, 2018.
LM19: Yi Luo and Nima Mesgarani. Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27(8):1256–1266, 2019.
MSP18: Ethan Manilow, Prem Seetharaman, and Bryan Pardo. "the northwestern university source separation library". In "Proceedings of the 19th International Society of Music Information Retrieval Conference (ISMIR 2018), Paris, France, September 23-27". 2018.
MSP20: Ethan Manilow, Prem Seetharaman, and Bryan Pardo. Simultaneous separation and transcription of mixtures with multiple polyphonic and percussive instruments. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 771–775. IEEE, 2020.
MYK+19: Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, and Noboru Harada. Deep griffin–lim iteration. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 61–65. IEEE, 2019.
MV10: Annamaria Mesaros and Tuomas Virtanen. Automatic recognition of lyrics in singing. EURASIP Journal on Audio, Speech, and Music Processing, 2010(1):546047, 2010.
MBP19: Gabriel Meseguer-Brocal and Geoffroy Peeters. Conditioned-u-net: introducing a control mechanism in the u-net for multiple source separations. arXiv preprint arXiv:1907.01277, 2019.
Mik12: Tomáš Mikolov. Statistical language models based on neural networks. PhD thesis, Brno University of Technology, 2012.
ODZ+16: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
PBSondergaard13: Nathanaël Perraudin, Peter Balazs, and Peter L Søndergaard. A fast griffin-lim algorithm. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1–4. IEEE, 2013.
PCC+20: Darius Petermann, Pritish Chandna, Helena Cuesta, Jordi Bonada, and Emilia Gomez. Deep learning based source separation applied to choir ensembles. arXiv preprint arXiv:2008.07645, 2020.
PP18: Fatemeh Pishdadian and Bryan Pardo. Multi-resolution common fate transform. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(2):342–354, 2018.
PAB+02: Mark D Plumbley, Samer A Abdallah, Juan Pablo Bello, Mike E Davies, Giuliano Monti, and Mark B Sandler. Automatic music transcription and audio source separation. Cybernetics &Systems, 33(6):603–627, 2002.
RLStoter+17: Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation. December 2017. URL: https://doi.org/10.5281/zenodo.1117372, doi:10.5281/zenodo.1117372.
RLStoter+18: Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, Derry FitzGerald, and Bryan Pardo. An overview of lead and accompaniment separation in music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(8):1307–1335, 2018.
RLS+19: Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. Musdb18-hq - an uncompressed version of musdb18. August 2019. URL: https://doi.org/10.5281/zenodo.3338373, doi:10.5281/zenodo.3338373.
RP11: Zafar Rafii and Bryan Pardo. Degenerate unmixing estimation technique using the constant q transform. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 217–220. IEEE, 2011.
RP12a: Zafar Rafii and Bryan Pardo. Music/voice separation using the similarity matrix. In ISMIR. 2012.
RP12b: Zafar Rafii and Bryan Pardo. Repeating pattern extraction technique (repet): a simple method for music/voice separation. IEEE transactions on audio, speech, and language processing, 21(1):73–84, 2012.
SBStoter+18: Michael Schoeffler, Sarah Bartoschek, Fabian-Robert Stöter, Marlene Roess, Susanne Westphal, Bernd Edler, and Jürgen Herre. Webmushra—a comprehensive framework for web-based listening tests. Journal of Open Research Software, 2018.
SPP17: Prem Seetharaman, Fatemeh Pishdadian, and Bryan Pardo. Music/voice separation using the 2d fourier transform. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 36–40. IEEE, 2017.
SWPR20: Prem Seetharaman, Gordon Wichern, Bryan Pardo, and Jonathan Le Roux. Autoclip: adaptive gradient clipping for source separation networks. arXiv preprint arXiv:2007.14469, 2020.
SDL19: Bidisha Sharma, Rohan Kumar Das, and Haizhou Li. On the importance of audio-source separation for singer identification in polyphonic music. In INTERSPEECH, 2020–2024. 2019.
SLL+19: Ziqiang Shi, Huibin Lin, Liu Liu, Rujie Liu, and Jiqing Han. Is cqt more suitable for monaural speech separation than stft? an empirical study. arXiv preprint arXiv:1902.00631, 2019.
SHGomez20: Olga Slizovskaia, Gloria Haro, and Emilia Gómez. Conditioned source separation for music instrument performances. arXiv preprint arXiv:2004.03873, 2020.
SI11: Juliu O Smith III. Spectral audio signal processing. W3K publishing, 2011.
SED18a: Daniel Stoller, Sebastian Ewert, and Simon Dixon. Jointly detecting and separating singing voice: a multi-task approach. In International Conference on Latent Variable Analysis and Signal Separation, 329–339. Springer, Cham, 2018.
SED18b: Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185, 2018.
StoterLB+16: Fabian-Robert Stöter, Antoine Liutkus, Roland Badeau, Bernd Edler, and Paul Magron. Common fate model for unison source separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 126–130. IEEE, 2016.
TM20: Naoya Takahashi and Yuki Mitsufuji. D3net: densely connected multidilated densenet for music source separation. arXiv preprint arXiv:2010.01733, 2020.
UVL16: Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
VGFevotte06: Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing, 14(4):1462–1469, 2006.
VVG18: Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot. Audio source separation and speech enhancement. John Wiley & Sons, 2018.
WLRH18: Zhong-Qiu Wang, Jonathan Le Roux, and John R Hershey. Alternative objective functions for deep clustering. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 686–690. IEEE, 2018.
WHLRS14: Felix Weninger, John R Hershey, Jonathan Le Roux, and Björn Schuller. Discriminatively trained recurrent neural networks for single-channel speech separation. In 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 577–581. IEEE, 2014.
WWollmerS11: Felix Weninger, Martin Wöllmer, and Björn Schuller. Automatic assessment of singer traits in popular music: gender, age, height and race. In Proc. 12th Intern. Society for Music Information Retrieval Conference, ISMIR 2011, Miami, FL, USA. 2011.
WLR18: Gordon Wichern and Jonathan Le Roux. Phase reconstruction with learned time-frequency representations for single-channel speech separation. In 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), 396–400. IEEE, 2018.
WH18: Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), 3–19. 2018.
ZHSJ19: Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: a theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
DumoulinVisin16: Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. ArXiv e-prints, mar 2016. arXiv:1603.07285.

Open-Source Tools & Data for Music Source Separation

References¶