References¶
- BSF94
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
- Bro91
Judith C Brown. Calculation of a constant q spectral transform. The Journal of the Acoustical Society of America, 89(1):425–434, 1991.
- BP92
Judith C Brown and Miller S Puckette. An efficient algorithm for the calculation of a constant q transform. The Journal of the Acoustical Society of America, 92(5):2698–2701, 1992.
- CFL+18
Estefania Cano, Derry FitzGerald, Antoine Liutkus, Mark D Plumbley, and Fabian-Robert Stöter. Musical source separation: an introduction. IEEE Signal Processing Magazine, 36(1):31–40, 2018.
- CPMH16
Mark Cartwright, Bryan Pardo, Gautham J Mysore, and Matt Hoffman. Fast and easy crowdsourced perceptual audio evaluation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 619–623. IEEE, 2016.
- CLM17
Zhuo Chen, Yi Luo, and Nima Mesgarani. Deep attractor network for single-microphone speaker separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 246–250. IEEE, 2017.
- CKH+18
Hyeong-Seok Choi, Jang-Hyun Kim, Jaesung Huh, Adrian Kim, Jung-Woo Ha, and Kyogu Lee. Phase-aware speech enhancement with deep complex u-net. In International Conference on Learning Representations. 2018.
- DefossezUBB19a
Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Demucs: deep extractor for music sources with extra unlabeled data remixed. arXiv preprint arXiv:1909.01174, 2019.
- DefossezUBB19b
Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254, 2019.
- EAC+18
Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. Gansynth: adversarial neural audio synthesis. In International Conference on Learning Representations. 2018.
- EGR+19
Jesse Engel, Chenjie Gu, Adam Roberts, and others. Ddsp: differentiable digital signal processing. In International Conference on Learning Representations. 2019.
- FBR12
Benoit Fuentes, Roland Badeau, and Gaël Richard. Blind harmonic adaptive decomposition applied to supervised source separation. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), 2654–2658. IEEE, 2012.
- FGO+06
Hiromasa Fujihara, Masataka Goto, Jun Ogata, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G Okuno. Automatic synchronization between lyrics and music cd recordings based on viterbi alignment of segregated vocal signals. In Eighth IEEE International Symposium on Multimedia (ISM'06), 257–264. IEEE, 2006.
- GSD12
Joachim Ganseman, Paul Scheunders, and Simon Dixon. Improving plca-based score-informed source separation with invertible constant-q transforms. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), 2634–2638. IEEE, 2012.
- GL84
Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
- GS10
David Gunawan and Deep Sen. Iterative phase estimation for the synthesis of separated sources from single-channel mixtures. IEEE Signal Processing Letters, 17(5):421–424, 2010.
- HMW20a
Verena Haunschmid, Ethan Manilow, and Gerhard Widmer. Towards musically meaningful explanations using source separation. arXiv preprint arXiv:2009.02051, 2020.
- HMW20b
Verena Haunschmid, Ethan Manilow, and Gerhard Widmer. Audiolime: listenable explanations using source separation. 13th International Workshop on Machine Learning and Music 2020, pages 20, 2020.
- HKV09
Toni Heittola, Anssi Klapuri, and Tuomas Virtanen. Musical instrument recognition in polyphonic audio using source-filter model for sound separation. In ISMIR, 327–332. 2009.
- HKVM20
Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5(50):2154, 2020. Deezer Research. URL: https://doi.org/10.21105/joss.02154, doi:10.21105/joss.02154.
- HCLRW16
John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. Deep clustering: discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 31–35. IEEE, 2016.
- HL15
Ying Hu and Guizhong Liu. Separation of singing voice using nonnegative matrix partial co-factorization for singer identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(4):643–653, 2015.
- HL20
Yun-Ning Hung and Alexander Lerch. Multitask learning for instrument activation aware music source separation. arXiv preprint arXiv:2008.00616, 2020.
- IS15
Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- JFB+11
Rajesh Jaiswal, Derry FitzGerald, Dan Barry, Eugene Coyle, and Scott Rickard. Clustering nmf basis functions using shifted nmf for monaural sound source separation. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 245–248. IEEE, 2011.
- JBEW19
Andreas Jansson, Rachel M Bittner, Sebastian Ewert, and Tillman Weyde. Joint singing voice separation and f0 estimation with deep u-net architectures. In 2019 27th European Signal Processing Conference (EUSIPCO), 1–5. IEEE, 2019.
- JHM+17
Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep u-net convolutional networks. International Society for Music Information Retrieval Conference, 2017, 2017.
- KMHGomez20
Venkatesh S Kadandale, Juan F Montesinos, Gloria Haro, and Emilia Gómez. Multi-task u-net for music source separation. arXiv preprint arXiv:2003.10414, 2020.
- LRWW+19
Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sarroff, and John R Hershey. Phasebook and friends: leveraging discrete representations for source separation. IEEE Journal of Selected Topics in Signal Processing, 13(2):370–382, 2019.
- LRWEH19
Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R Hershey. Sdr–half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 626–630. IEEE, 2019.
- LluisPS19
Francesc Lluís, Jordi Pons, and Xavier Serra. End-to-end music source separation: is it possible in the waveform domain? Proc. Interspeech 2019, pages 4619–4623, 2019.
- LSC+18
Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee, Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Per-channel energy normalization: why and how. IEEE Signal Processing Letters, 26(1):39–43, 2018.
- LCH+17
Yi Luo, Zhuo Chen, John R Hershey, Jonathan Le Roux, and Nima Mesgarani. Deep clustering and conventional networks for music separation: stronger together. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 61–65. IEEE, 2017.
- LM18
Yi Luo and Nima Mesgarani. Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 696–700. IEEE, 2018.
- LM19
Yi Luo and Nima Mesgarani. Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27(8):1256–1266, 2019.
- MSP18
Ethan Manilow, Prem Seetharaman, and Bryan Pardo. "the northwestern university source separation library". In "Proceedings of the 19th International Society of Music Information Retrieval Conference (ISMIR 2018), Paris, France, September 23-27". 2018.
- MSP20
Ethan Manilow, Prem Seetharaman, and Bryan Pardo. Simultaneous separation and transcription of mixtures with multiple polyphonic and percussive instruments. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 771–775. IEEE, 2020.
- MYK+19
Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, and Noboru Harada. Deep griffin–lim iteration. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 61–65. IEEE, 2019.
- MV10
Annamaria Mesaros and Tuomas Virtanen. Automatic recognition of lyrics in singing. EURASIP Journal on Audio, Speech, and Music Processing, 2010(1):546047, 2010.
- MBP19
Gabriel Meseguer-Brocal and Geoffroy Peeters. Conditioned-u-net: introducing a control mechanism in the u-net for multiple source separations. arXiv preprint arXiv:1907.01277, 2019.
- Mik12
Tomáš Mikolov. Statistical language models based on neural networks. PhD thesis, Brno University of Technology, 2012.
- ODZ+16
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- PBSondergaard13
Nathanaël Perraudin, Peter Balazs, and Peter L Søndergaard. A fast griffin-lim algorithm. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1–4. IEEE, 2013.
- PCC+20
Darius Petermann, Pritish Chandna, Helena Cuesta, Jordi Bonada, and Emilia Gomez. Deep learning based source separation applied to choir ensembles. arXiv preprint arXiv:2008.07645, 2020.
- PP18
Fatemeh Pishdadian and Bryan Pardo. Multi-resolution common fate transform. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(2):342–354, 2018.
- PAB+02
Mark D Plumbley, Samer A Abdallah, Juan Pablo Bello, Mike E Davies, Giuliano Monti, and Mark B Sandler. Automatic music transcription and audio source separation. Cybernetics &Systems, 33(6):603–627, 2002.
- RLStoter+17
Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation. December 2017. URL: https://doi.org/10.5281/zenodo.1117372, doi:10.5281/zenodo.1117372.
- RLStoter+18
Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, Derry FitzGerald, and Bryan Pardo. An overview of lead and accompaniment separation in music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(8):1307–1335, 2018.
- RLS+19
Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. Musdb18-hq - an uncompressed version of musdb18. August 2019. URL: https://doi.org/10.5281/zenodo.3338373, doi:10.5281/zenodo.3338373.
- RP11
Zafar Rafii and Bryan Pardo. Degenerate unmixing estimation technique using the constant q transform. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 217–220. IEEE, 2011.
- RP12a
Zafar Rafii and Bryan Pardo. Music/voice separation using the similarity matrix. In ISMIR. 2012.
- RP12b
Zafar Rafii and Bryan Pardo. Repeating pattern extraction technique (repet): a simple method for music/voice separation. IEEE transactions on audio, speech, and language processing, 21(1):73–84, 2012.
- SBStoter+18
Michael Schoeffler, Sarah Bartoschek, Fabian-Robert Stöter, Marlene Roess, Susanne Westphal, Bernd Edler, and Jürgen Herre. Webmushra—a comprehensive framework for web-based listening tests. Journal of Open Research Software, 2018.
- SPP17
Prem Seetharaman, Fatemeh Pishdadian, and Bryan Pardo. Music/voice separation using the 2d fourier transform. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 36–40. IEEE, 2017.
- SWPR20
Prem Seetharaman, Gordon Wichern, Bryan Pardo, and Jonathan Le Roux. Autoclip: adaptive gradient clipping for source separation networks. arXiv preprint arXiv:2007.14469, 2020.
- SDL19
Bidisha Sharma, Rohan Kumar Das, and Haizhou Li. On the importance of audio-source separation for singer identification in polyphonic music. In INTERSPEECH, 2020–2024. 2019.
- SLL+19
Ziqiang Shi, Huibin Lin, Liu Liu, Rujie Liu, and Jiqing Han. Is cqt more suitable for monaural speech separation than stft? an empirical study. arXiv preprint arXiv:1902.00631, 2019.
- SHGomez20
Olga Slizovskaia, Gloria Haro, and Emilia Gómez. Conditioned source separation for music instrument performances. arXiv preprint arXiv:2004.03873, 2020.
- SI11
Juliu O Smith III. Spectral audio signal processing. W3K publishing, 2011.
- SED18a
Daniel Stoller, Sebastian Ewert, and Simon Dixon. Jointly detecting and separating singing voice: a multi-task approach. In International Conference on Latent Variable Analysis and Signal Separation, 329–339. Springer, Cham, 2018.
- SED18b
Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185, 2018.
- StoterLB+16
Fabian-Robert Stöter, Antoine Liutkus, Roland Badeau, Bernd Edler, and Paul Magron. Common fate model for unison source separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 126–130. IEEE, 2016.
- TM20
Naoya Takahashi and Yuki Mitsufuji. D3net: densely connected multidilated densenet for music source separation. arXiv preprint arXiv:2010.01733, 2020.
- UVL16
Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
- VGFevotte06
Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing, 14(4):1462–1469, 2006.
- VVG18
Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot. Audio source separation and speech enhancement. John Wiley & Sons, 2018.
- WLRH18
Zhong-Qiu Wang, Jonathan Le Roux, and John R Hershey. Alternative objective functions for deep clustering. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 686–690. IEEE, 2018.
- WHLRS14
Felix Weninger, John R Hershey, Jonathan Le Roux, and Björn Schuller. Discriminatively trained recurrent neural networks for single-channel speech separation. In 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 577–581. IEEE, 2014.
- WWollmerS11
Felix Weninger, Martin Wöllmer, and Björn Schuller. Automatic assessment of singer traits in popular music: gender, age, height and race. In Proc. 12th Intern. Society for Music Information Retrieval Conference, ISMIR 2011, Miami, FL, USA. 2011.
- WLR18
Gordon Wichern and Jonathan Le Roux. Phase reconstruction with learned time-frequency representations for single-channel speech separation. In 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), 396–400. IEEE, 2018.
- WH18
Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), 3–19. 2018.
- ZHSJ19
Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: a theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
- DumoulinVisin16
Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. ArXiv e-prints, mar 2016. arXiv:1603.07285.