Download Real-Time Singing Voice Conversion Plug-In In this paper, we propose an approach to real-time singing voice conversion and outline its development as a plug-in suitable for streaming use in a digital audio workstation. In order to simultaneously ensure pitch preservation and reduce the computational complexity of the overall system, we adopt a source-filter methodology and consider a vocoder-free paradigm for modeling the conversion task. In this case, the source is extracted and altered using more traditional DSP techniques, while the filter is determined using a deep neural network. The latter can be trained in an end-toend fashion and additionally uses adversarial training to improve system fidelity. Careful design allows the system to scale naturally to sampling rates higher than the neural filter model sampling rate, outputting full-band signals while avoiding the need for resampling. Accordingly, the resulting system, when operating at 44.1 kHz, incurs under 60 ms of latency and operates 20 times faster than real-time on a standard laptop CPU.
Download Differentiable All-Pass Filters for Phase Response Estimation and Automatic Signal Alignment Virtual analog (VA) audio effects are increasingly based on neural networks and deep learning frameworks. Due to the underlying black-box methodology, a successful model will learn to approximate the data it is presented, including potential errors such as latency and audio dropouts as well as non-linear characteristics and frequency-dependent phase shifts produced by the hardware. The latter is of particular interest as the learned phase-response might cause unwanted audible artifacts when the effect is used for creative processing techniques such as dry-wet mixing or parallel compression. To overcome these artifacts we propose differentiable signal processing tools and deep optimization structures for automatically tuning all-pass filters to predict the phase response of different VA simulations, and align processed signals that are out of phase. The approaches are assessed using objective metrics while listening tests evaluate their ability to enhance the quality of parallel path processing techniques. Ultimately, an overparameterized, BiasNet-based, all-pass model is proposed for the optimization problem under consideration, resulting in models that can estimate all-pass filter coefficients to align a dry signal with its affected, wet, equivalent.
Download Pywdf: An Open Source Library for Prototyping and Simulating Wave Digital Filter Circuits in Python This paper introduces a new open-source Python library for the modeling and simulation of wave digital filter (WDF) circuits. The library, called pwydf, allows users to easily create and analyze WDF circuit models in a high-level, object-oriented manner. The library includes a variety of built-in components, such as voltage sources, capacitors, diodes etc., as well as the ability to create custom components and circuits. Additionally, pywdf includes a variety of analysis tools, such as frequency response and transient analysis, to aid in the design and optimization of WDF circuits. We demonstrate the library’s efficacy in replicating the nonlinear behavior of an analog diode clipper circuit, and in creating an allpass filter that cannot be realized in the analog world. The library is well-documented and includes several examples to help users get started. Overall, pywdf is a powerful tool for anyone working with WDF circuits, and we hope it can be of great use to researchers and engineers in the field.
Download Design of FPGA-based High-order FDTD Method for Room Acoustics Sound field rendering with finite difference time domain (FDTD) method is computation-intensive and memory-intensive. This research investigates an FPGA-based acceleration system for sound field rendering with the high-order FDTD method, in which spatial and temporal blockings are applied to alleviate external memory bandwidth bottleneck and reuse data, respectively. After implemented by using the FPGA card DE10-Pro, the FPGA-based sound field rendering systems outperform the software simulations conducted on a desktop machine with 512 GB DRAMs and a Xeon Gold 6212U processor (24 cores) running at 2.4 GHz by 11 times, 13 times, and 18 times in computing performance in the case of the 2nd-order, 4th-order, and 6th-order FDTD schemes, respectively, even though the FPGA-based sound field rendering systems run at much lower clock frequency and have much smaller on-chip and external memory.
Download A Virtual Instrument for Ifft-Based Additive Synthesis in the Ambisonics Domain Spatial additive synthesis can be efficiently implemented by applying the inverse Fourier transform to create the individual channels of Ambisonics signals. In the presented work, this approach has been implemented as an audio plugin, allowing the generation and control of basic waveforms and their spatial attributes in a typical DAW-based music production context. Triggered envelopes and low frequency oscillators can be mapped to the spectral shape, source position and source width of the resulting sounds. A technical evaluation shows the computational advantages of the proposed method for additive sounds with high numbers of partials and different Ambisonics orders. The results of a user study indicate the potential of the developed plugin for manipulating the perceived position, source width and timbre coloration.
Download Vocal Timbre Effects with Differentiable Digital Signal Processing We explore two approaches to creatively altering vocal timbre using Differentiable Digital Signal Processing (DDSP). The first approach is inspired by classic cross-synthesis techniques. A pretrained DDSP decoder predicts a filter for a noise source and a harmonic distribution, based on pitch and loudness information extracted from the vocal input. Before synthesis, the harmonic distribution is modified by interpolating between the predicted distribution and the harmonics of the input. We provide a real-time implementation of this approach in the form of a Neutone model. In the second approach, autoencoder models are trained on datasets consisting of both vocal and instrument training data. To apply the effect, the trained autoencoder attempts to reconstruct the vocal input. We find that there is a desirable “sweet spot” during training, where the model has learned to reconstruct the phonetic content of the input vocals, but is still affected by the timbre of the instrument mixed into the training data. After further training, that effect disappears. A perceptual evaluation compares the two approaches. We find that the autoencoder in the second approach is able to reconstruct intelligible lyrical content without any explicit phonetic information provided during training.
Download Automatic Recognition of Cascaded Guitar Effects This paper reports on a new multi-label classification task for guitar effect recognition that is closer to the actual use case of guitar effect pedals. To generate the dataset, we used multiple clean guitar audio datasets and applied various combinations of 13 commonly used guitar effects. We compared four neural network structures: a simple Multi-Layer Perceptron as a baseline, ResNet models, a CRNN model, and a sample-level CNN model. The ResNet models achieved the best performance in terms of accuracy and robustness under various setups (with or without clean audio, seen or unseen dataset), with a micro F1 of 0.876 and Macro F1 of 0.906 in the hardest setup. An ablation study on the ResNet models further indicates the necessary model complexity for the task.
Download Informed Source Separation for Stereo Unmixing — An Open Source Implementation Active listening consists in interacting with the music playing and has numerous potential applications from pedagogy to gaming, through creation. In the context of music industry, using existing musical recordings (e.g. studio stems), it could be possible for the listener to generate new versions of a given musical piece (i.e. artistic mix). But imagine one could do this from the original mix itself. In a previous research project, we proposed a coder / decoder scheme for what we called informed source separation: The coder determines the information necessary to recover the tracks and embeds it inaudibly (using watermarking) in the mix. The decoder enhances the source separation with this information. We proposed and patented several methods, using various types of embedded information and separation techniques, hoping that the music industry was ready to give the listener this freedom of active listening. Fortunately, there are numerous other applications possible, such as the manipulation of musical archives, for example in the context of ethnomusicology. But the patents remain for many years, which is problematic. In this article, we present an open-source implementation of a patent-free algorithm to address the mixing and unmixing audio problem for any type of music.
Download Differentiable Feedback Delay Network for Colorless Reverberation Artificial reverberation algorithms often suffer from spectral coloration, usually in the form of metallic ringing, which impairs the perceived quality of sound. This paper proposes a method to reduce the coloration in the feedback delay network (FDN), a popular artificial reverberation algorithm. An optimization framework is employed entailing a differentiable FDN to learn a set of parameters decreasing coloration. The optimization objective is to minimize the spectral loss to obtain a flat magnitude response, with an additional temporal loss term to control the sparseness of the impulse response. The objective evaluation of the method shows a favorable narrower distribution of modal excitation while retaining the impulse response density. The subjective evaluation demonstrates that the proposed method lowers perceptual coloration of late reverberation, and also shows that the suggested optimization improves sound quality for small FDN sizes. The method proposed in this work constitutes an improvement in the design of accurate and high-quality artificial reverberation, simultaneously offering computational savings.
Download Vocal Tract Area Estimation by Gradient Descent Articulatory features can provide interpretable and flexible controls for the synthesis of human vocalizations by allowing the user to directly modify parameters like vocal strain or lip position. To make this manipulation through resynthesis possible, we need to estimate the features that result in a desired vocalization directly from audio recordings. In this work, we propose a white-box optimization technique for estimating glottal source parameters and vocal tract shapes from audio recordings of human vowels. The approach is based on inverse filtering and optimizing the frequency response of a waveguide model of the vocal tract with gradient descent, propagating error gradients through the mapping of articulatory features to the vocal tract area function. We apply this method to the task of matching the sound of the Pink Trombone, an interactive articulatory synthesizer, to a given vocalization. We find that our method accurately recovers control functions for audio generated by the Pink Trombone itself. We then compare our technique against evolutionary optimization algorithms and a neural network trained to predict control parameters from audio. A subjective evaluation finds that our approach outperforms these black-box optimization baselines on the task of reproducing human vocalizations.