References

BSF94

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.

Bro91

Judith C Brown. Calculation of a constant q spectral transform. The Journal of the Acoustical Society of America, 89(1):425–434, 1991.

BP92

Judith C Brown and Miller S Puckette. An efficient algorithm for the calculation of a constant q transform. The Journal of the Acoustical Society of America, 92(5):2698–2701, 1992.

CFL+18

Estefania Cano, Derry FitzGerald, Antoine Liutkus, Mark D Plumbley, and Fabian-Robert Stöter. Musical source separation: an introduction. IEEE Signal Processing Magazine, 36(1):31–40, 2018.

CPMH16

Mark Cartwright, Bryan Pardo, Gautham J Mysore, and Matt Hoffman. Fast and easy crowdsourced perceptual audio evaluation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 619–623. IEEE, 2016.

CLM17

Zhuo Chen, Yi Luo, and Nima Mesgarani. Deep attractor network for single-microphone speaker separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 246–250. IEEE, 2017.

CKH+18

Hyeong-Seok Choi, Jang-Hyun Kim, Jaesung Huh, Adrian Kim, Jung-Woo Ha, and Kyogu Lee. Phase-aware speech enhancement with deep complex u-net. In International Conference on Learning Representations. 2018.

DefossezUBB19a

Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Demucs: deep extractor for music sources with extra unlabeled data remixed. arXiv preprint arXiv:1909.01174, 2019.

DefossezUBB19b

Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254, 2019.

EAC+18

Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. Gansynth: adversarial neural audio synthesis. In International Conference on Learning Representations. 2018.

EGR+19

Jesse Engel, Chenjie Gu, Adam Roberts, and others. Ddsp: differentiable digital signal processing. In International Conference on Learning Representations. 2019.

FBR12

Benoit Fuentes, Roland Badeau, and Gaël Richard. Blind harmonic adaptive decomposition applied to supervised source separation. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), 2654–2658. IEEE, 2012.

FGO+06

Hiromasa Fujihara, Masataka Goto, Jun Ogata, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G Okuno. Automatic synchronization between lyrics and music cd recordings based on viterbi alignment of segregated vocal signals. In Eighth IEEE International Symposium on Multimedia (ISM'06), 257–264. IEEE, 2006.

GSD12

Joachim Ganseman, Paul Scheunders, and Simon Dixon. Improving plca-based score-informed source separation with invertible constant-q transforms. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), 2634–2638. IEEE, 2012.

GL84

Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.

GS10

David Gunawan and Deep Sen. Iterative phase estimation for the synthesis of separated sources from single-channel mixtures. IEEE Signal Processing Letters, 17(5):421–424, 2010.

HMW20a

Verena Haunschmid, Ethan Manilow, and Gerhard Widmer. Towards musically meaningful explanations using source separation. arXiv preprint arXiv:2009.02051, 2020.

HMW20b

Verena Haunschmid, Ethan Manilow, and Gerhard Widmer. Audiolime: listenable explanations using source separation. 13th International Workshop on Machine Learning and Music 2020, pages 20, 2020.

HKV09

Toni Heittola, Anssi Klapuri, and Tuomas Virtanen. Musical instrument recognition in polyphonic audio using source-filter model for sound separation. In ISMIR, 327–332. 2009.

HKVM20

Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5(50):2154, 2020. Deezer Research. URL: https://doi.org/10.21105/joss.02154, doi:10.21105/joss.02154.

HCLRW16

John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. Deep clustering: discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 31–35. IEEE, 2016.

HL15

Ying Hu and Guizhong Liu. Separation of singing voice using nonnegative matrix partial co-factorization for singer identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(4):643–653, 2015.

HL20

Yun-Ning Hung and Alexander Lerch. Multitask learning for instrument activation aware music source separation. arXiv preprint arXiv:2008.00616, 2020.

IS15

Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

JFB+11

Rajesh Jaiswal, Derry FitzGerald, Dan Barry, Eugene Coyle, and Scott Rickard. Clustering nmf basis functions using shifted nmf for monaural sound source separation. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 245–248. IEEE, 2011.

JBEW19

Andreas Jansson, Rachel M Bittner, Sebastian Ewert, and Tillman Weyde. Joint singing voice separation and f0 estimation with deep u-net architectures. In 2019 27th European Signal Processing Conference (EUSIPCO), 1–5. IEEE, 2019.

JHM+17

Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep u-net convolutional networks. International Society for Music Information Retrieval Conference, 2017, 2017.

KMHGomez20

Venkatesh S Kadandale, Juan F Montesinos, Gloria Haro, and Emilia Gómez. Multi-task u-net for music source separation. arXiv preprint arXiv:2003.10414, 2020.

LRWW+19

Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sarroff, and John R Hershey. Phasebook and friends: leveraging discrete representations for source separation. IEEE Journal of Selected Topics in Signal Processing, 13(2):370–382, 2019.

LRWEH19

Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R Hershey. Sdr–half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 626–630. IEEE, 2019.

LluisPS19

Francesc Lluís, Jordi Pons, and Xavier Serra. End-to-end music source separation: is it possible in the waveform domain? Proc. Interspeech 2019, pages 4619–4623, 2019.

LSC+18

Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee, Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Per-channel energy normalization: why and how. IEEE Signal Processing Letters, 26(1):39–43, 2018.

LCH+17

Yi Luo, Zhuo Chen, John R Hershey, Jonathan Le Roux, and Nima Mesgarani. Deep clustering and conventional networks for music separation: stronger together. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 61–65. IEEE, 2017.

LM18

Yi Luo and Nima Mesgarani. Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 696–700. IEEE, 2018.

LM19

Yi Luo and Nima Mesgarani. Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27(8):1256–1266, 2019.

MSP18

Ethan Manilow, Prem Seetharaman, and Bryan Pardo. "the northwestern university source separation library". In "Proceedings of the 19th International Society of Music Information Retrieval Conference (ISMIR 2018), Paris, France, September 23-27". 2018.

MSP20

Ethan Manilow, Prem Seetharaman, and Bryan Pardo. Simultaneous separation and transcription of mixtures with multiple polyphonic and percussive instruments. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 771–775. IEEE, 2020.

MYK+19

Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, and Noboru Harada. Deep griffin–lim iteration. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 61–65. IEEE, 2019.

MV10

Annamaria Mesaros and Tuomas Virtanen. Automatic recognition of lyrics in singing. EURASIP Journal on Audio, Speech, and Music Processing, 2010(1):546047, 2010.

MBP19

Gabriel Meseguer-Brocal and Geoffroy Peeters. Conditioned-u-net: introducing a control mechanism in the u-net for multiple source separations. arXiv preprint arXiv:1907.01277, 2019.

Mik12

Tomáš Mikolov. Statistical language models based on neural networks. PhD thesis, Brno University of Technology, 2012.

ODZ+16

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.

PBSondergaard13

Nathanaël Perraudin, Peter Balazs, and Peter L Søndergaard. A fast griffin-lim algorithm. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1–4. IEEE, 2013.

PCC+20

Darius Petermann, Pritish Chandna, Helena Cuesta, Jordi Bonada, and Emilia Gomez. Deep learning based source separation applied to choir ensembles. arXiv preprint arXiv:2008.07645, 2020.

PP18

Fatemeh Pishdadian and Bryan Pardo. Multi-resolution common fate transform. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(2):342–354, 2018.

PAB+02

Mark D Plumbley, Samer A Abdallah, Juan Pablo Bello, Mike E Davies, Giuliano Monti, and Mark B Sandler. Automatic music transcription and audio source separation. Cybernetics &Systems, 33(6):603–627, 2002.

RLStoter+17

Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation. December 2017. URL: https://doi.org/10.5281/zenodo.1117372, doi:10.5281/zenodo.1117372.

RLStoter+18

Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, Derry FitzGerald, and Bryan Pardo. An overview of lead and accompaniment separation in music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(8):1307–1335, 2018.

RLS+19

Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. Musdb18-hq - an uncompressed version of musdb18. August 2019. URL: https://doi.org/10.5281/zenodo.3338373, doi:10.5281/zenodo.3338373.

RP11

Zafar Rafii and Bryan Pardo. Degenerate unmixing estimation technique using the constant q transform. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 217–220. IEEE, 2011.

RP12a

Zafar Rafii and Bryan Pardo. Music/voice separation using the similarity matrix. In ISMIR. 2012.

RP12b

Zafar Rafii and Bryan Pardo. Repeating pattern extraction technique (repet): a simple method for music/voice separation. IEEE transactions on audio, speech, and language processing, 21(1):73–84, 2012.

SBStoter+18

Michael Schoeffler, Sarah Bartoschek, Fabian-Robert Stöter, Marlene Roess, Susanne Westphal, Bernd Edler, and Jürgen Herre. Webmushra—a comprehensive framework for web-based listening tests. Journal of Open Research Software, 2018.

SPP17

Prem Seetharaman, Fatemeh Pishdadian, and Bryan Pardo. Music/voice separation using the 2d fourier transform. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 36–40. IEEE, 2017.

SWPR20

Prem Seetharaman, Gordon Wichern, Bryan Pardo, and Jonathan Le Roux. Autoclip: adaptive gradient clipping for source separation networks. arXiv preprint arXiv:2007.14469, 2020.

SDL19

Bidisha Sharma, Rohan Kumar Das, and Haizhou Li. On the importance of audio-source separation for singer identification in polyphonic music. In INTERSPEECH, 2020–2024. 2019.

SLL+19

Ziqiang Shi, Huibin Lin, Liu Liu, Rujie Liu, and Jiqing Han. Is cqt more suitable for monaural speech separation than stft? an empirical study. arXiv preprint arXiv:1902.00631, 2019.

SHGomez20

Olga Slizovskaia, Gloria Haro, and Emilia Gómez. Conditioned source separation for music instrument performances. arXiv preprint arXiv:2004.03873, 2020.

SI11

Juliu O Smith III. Spectral audio signal processing. W3K publishing, 2011.

SED18a

Daniel Stoller, Sebastian Ewert, and Simon Dixon. Jointly detecting and separating singing voice: a multi-task approach. In International Conference on Latent Variable Analysis and Signal Separation, 329–339. Springer, Cham, 2018.

SED18b

Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185, 2018.

StoterLB+16

Fabian-Robert Stöter, Antoine Liutkus, Roland Badeau, Bernd Edler, and Paul Magron. Common fate model for unison source separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 126–130. IEEE, 2016.

TM20

Naoya Takahashi and Yuki Mitsufuji. D3net: densely connected multidilated densenet for music source separation. arXiv preprint arXiv:2010.01733, 2020.

UVL16

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.

VGFevotte06

Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing, 14(4):1462–1469, 2006.

VVG18

Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot. Audio source separation and speech enhancement. John Wiley & Sons, 2018.

WLRH18

Zhong-Qiu Wang, Jonathan Le Roux, and John R Hershey. Alternative objective functions for deep clustering. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 686–690. IEEE, 2018.

WHLRS14

Felix Weninger, John R Hershey, Jonathan Le Roux, and Björn Schuller. Discriminatively trained recurrent neural networks for single-channel speech separation. In 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 577–581. IEEE, 2014.

WWollmerS11

Felix Weninger, Martin Wöllmer, and Björn Schuller. Automatic assessment of singer traits in popular music: gender, age, height and race. In Proc. 12th Intern. Society for Music Information Retrieval Conference, ISMIR 2011, Miami, FL, USA. 2011.

WLR18

Gordon Wichern and Jonathan Le Roux. Phase reconstruction with learned time-frequency representations for single-channel speech separation. In 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), 396–400. IEEE, 2018.

WH18

Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), 3–19. 2018.

ZHSJ19

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: a theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.

DumoulinVisin16

Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. ArXiv e-prints, mar 2016. arXiv:1603.07285.