Download Automatic Equalization for Individual Instrument Tracks Using Convolutional Neural Networks
We propose a novel approach for the automatic equalization of individual musical instrument tracks. Our method begins by identifying the instrument present within a source recording in order to choose its corresponding ideal spectrum as a target. Next, the spectral difference between the recording and the target is calculated, and accordingly, an equalizer matching model is used to predict settings for a parametric equalizer. To this end, we build upon a differentiable parametric equalizer matching neural network, demonstrating improvements relative to previously established state-of-the-art. Unlike past approaches, we show how our system naturally allows real-world audio data to be leveraged during the training of our matching model, effectively generating suitably produced training targets in an automated manner mirroring conditions at inference time. Consequently, we illustrate how fine-tuning our matching model on such examples considerably improves parametric equalizer matching performance in realworld scenarios, decreasing mean absolute error by 24% relative to methods relying solely on random parameter sampling techniques as a self-supervised learning strategy. We perform listening tests, and demonstrate that our proposed automatic equalization solution subjectively enhances the tonal characteristics for recordings of common instrument types.
Download Hyper Recurrent Neural Network: Condition Mechanisms for Black-Box Audio Effect Modeling
Recurrent neural networks (RNNs) have demonstrated impressive results for virtual analog modeling of audio effects. These networks process time-domain audio signals using a series of matrix multiplication and nonlinear activation functions to emulate the behavior of the target device accurately. To additionally model the effect of the knobs for an RNN-based model, existing approaches integrate control parameters by concatenating them channel-wisely with some intermediate representation of the input signal. While this method is parameter-efficient, there is room to further improve the quality of generated audio because the concatenation-based conditioning method has limited capacity in modulating signals. In this paper, we propose three novel conditioning mechanisms for RNNs, tailored for black-box virtual analog modeling. These advanced conditioning mechanisms modulate the model based on control parameters, yielding superior results to existing RNN- and CNN-based architectures across various evaluation metrics.
Download Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data
Recent years have seen increasing interest in applying deep learning methods to the modeling of guitar amplifiers or effect pedals. Existing methods are mainly based on the supervised approach, requiring temporally-aligned data pairs of unprocessed and rendered audio. However, this approach does not scale well, due to the complicated process involved in creating the data pairs. A very recent work done by Wright et al. has explored the potential of leveraging unpaired data for training, using a generative adversarial network (GAN)-based framework. This paper extends their work by using more advanced discriminators in the GAN, and using more unpaired data for training. Specifically, drawing inspiration from recent advancements in neural vocoders, we employ in our GANbased model for guitar amplifier modeling two sets of discriminators, one based on multi-scale discriminator (MSD) and the other multi-period discriminator (MPD). Moreover, we experiment with adding unprocessed audio signals that do not have the corresponding rendered audio of a target tone to the training data, to see how much the GAN model benefits from the unpaired data. Our experiments show that the proposed two extensions contribute to the modeling of both low-gain and high-gain guitar amplifiers.
Download DDSP-Based Neural Waveform Synthesis of Polyphonic Guitar Performance From String-Wise MIDI Input
We explore the use of neural synthesis for acoustic guitar from string-wise MIDI input. We propose four different systems and compare them with both objective metrics and subjective evaluation against natural audio and a sample-based baseline. We iteratively develop these four systems by making various considerations on the architecture and intermediate tasks, such as predicting pitch and loudness control features. We find that formulating the control feature prediction task as a classification task rather than a regression task yields better results. Furthermore, we find that our simplest proposed system, which directly predicts synthesis parameters from MIDI input performs the best out of the four proposed systems. Audio examples and code are available.
Download Distortion Recovery: A Two-Stage Method for Guitar Effect Removal
Removing audio effects from electric guitar recordings makes it easier for post-production and sound editing. An audio distortion recovery model not only improves the clarity of the guitar sounds but also opens up new opportunities for creative adjustments in mixing and mastering. While progress have been made in creating such models, previous efforts have largely focused on synthetic distortions that may be too simplistic to accurately capture the complexities seen in real-world recordings. In this paper, we tackle the task by using a dataset of guitar recordings rendered with commercial-grade audio effect VST plugins. Moreover, we introduce a novel two-stage methodology for audio distortion recovery. The idea is to firstly process the audio signal in the Mel-spectrogram domain in the first stage, and then use a neural vocoder to generate the pristine original guitar sound from the processed Mel-spectrogram in the second stage. We report a set of experiments demonstrating the effectiveness of our approach over existing methods, through both subjective and objective evaluation metrics.
Download RAVE for Speech: Efficient Voice Conversion at High Sampling Rates
Voice conversion has gained increasing popularity within the field of audio manipulation and speech synthesis. Often, the main objective is to transfer the input identity to that of a target speaker without changing its linguistic content. While current work provides high-fidelity solutions they rarely focus on model simplicity, high-sampling rate environments or stream-ability. By incorporating speech representation learning into a generative timbre transfer model, traditionally created for musical purposes, we investigate the realm of voice conversion generated directly in the time domain at high sampling rates. More specifically, we guide the latent space of a baseline model towards linguistically relevant representations and condition it on external speaker information. Through objective and subjective assessments, we demonstrate that the proposed solution can attain levels of naturalness, quality, and intelligibility comparable to those of a state-of-the-art solution for seen speakers, while significantly decreasing inference time. However, despite the presence of target speaker characteristics in the converted output, the actual similarity to unseen speakers remains a challenge.
Download Towards Efficient Modelling of String Dynamics: A Comparison of State Space and Koopman Based Deep Learning Methods
This paper presents an examination of State Space Models (SSM) and Koopman-based deep learning methods for modelling the dynamics of both linear and non-linear stiff strings. Through experiments with datasets generated under different initial conditions and sample rates, we assess the capacity of these models to accurately model the complex behaviours observed in string dynamics. Our findings indicate that our proposed Koopman-based model performs as well as or better than other existing approaches in nonlinear cases for long-sequence modelling. We inform the design of these architectures with the structure of the problems at hand. Although challenges remain in extending model predictions beyond the training horizon (i.e., extrapolation), the focus of our investigation lies in the models’ ability to generalise across different initial conditions within the training time interval. This research contributes insights into the physical modelling of dynamical systems (in particular those addressing musical acoustics) by offering a comparative overview of these and previous methods and introducing innovative strategies for model improvement. Our results highlight the efficacy of these models in simulating non-linear dynamics and emphasise their wide-ranging applicability in accurately modelling dynamical systems over extended sequences.
Download DDSP-SFX: Acoustically-Guided Sound Effects Generation with Differentiable Digital Signal Processing
Controlling the variations of sound effects using neural audio synthesis models has been a challenging task. Differentiable digital signal processing (DDSP) provides a lightweight solution that achieves high-quality sound synthesis while enabling deterministic acoustic attribute control by incorporating pre-processed audio features and digital synthesizers. In this research, we introduce DDSP-SFX, a model based on the DDSP architecture capable of synthesizing high-quality sound effects while enabling users to control the timbre variations easily. We integrate a transient modelling algorithm in DDSP that achieves higher objective evaluation scores and subjective ratings over impulsive signals (footsteps, gunshots). We propose a novel method that achieves frame-level timbre variation control while also allowing deterministic attribute control. We further qualitatively show the timbre transfer performance using voice as the guiding sound.
Download Searching for Music Mixing Graphs: A Pruning Approach
Music mixing is compositional — experts combine multiple audio processors to achieve a cohesive mix from dry source tracks. We propose a method to reverse engineer this process from the input and output audio. First, we create a mixing console that applies all available processors to every chain. Then, after the initial console parameter optimization, we alternate between removing redundant processors and fine-tuning. We achieve this through differentiable implementation of both processors and pruning. Consequently, we find a sparse mixing graph that achieves nearly identical matching quality of the full mixing console. We apply this procedure to drymix pairs from various datasets and collect graphs that also can be used to train neural networks for music mixing applications.
Download Differentiable All-Pole Filters for Time-Varying Audio Systems
Infinite impulse response filters are an essential building block of many time-varying audio systems, such as audio effects and synthesisers. However, their recursive structure impedes end-toend training of these systems using automatic differentiation. Although non-recursive filter approximations like frequency sampling and frame-based processing have been proposed and widely used in previous works, they cannot accurately reflect the gradient of the original system. We alleviate this difficulty by reexpressing a time-varying all-pole filter to backpropagate the gradients through itself, so the filter implementation is not bound to the technical limitations of automatic differentiation frameworks. This implementation can be employed within audio systems containing filters with poles for efficient gradient evaluation. We demonstrate its training efficiency and expressive capabilities for modelling real-world dynamic audio systems on a phaser, time-varying subtractive synthesiser, and feed-forward compressor. We make our code and audio samples available and provide the trained audio effect and synth models in a VST plugin1 .