Download Neural Parametric Equalizer Matching Using Differentiable Biquads This paper proposes a neural network for carrying out parametric equalizer (EQ) matching. The novelty of this neural network
solution is that it can be optimized directly in the frequency domain by means of differentiable biquads, rather than relying solely
on a loss on parameter values which does not correlate directly
with the system output. We compare the performance of the proposed neural network approach with that of a baseline algorithm
based on a convex relaxation of the problem. It is observed that the
neural network can provide better matching than the baseline approach because it directly attempts to solve the non-convex problem. Moreover, we show that the same network trained with only
a parameter loss is insufficient for the task, despite the fact that it
matches underlying EQ parameters better than one trained with a
combination of spectral and parameter losses.
Download Neural Modelling of Time-Varying Effects This paper proposes a grey-box neural network based approach
to modelling LFO modulated time-varying effects.
The neural
network model receives both the unprocessed audio, as well as
the LFO signal, as input. This allows complete control over the
model’s LFO frequency and shape. The neural networks are trained
using guitar audio, which has to be processed by the target effect
and also annotated with the predicted LFO signal before training.
A measurement signal based on regularly spaced chirps was used
to accurately predict the LFO signal. The model architecture has
been previously shown to be capable of running in real-time on a
modern desktop computer, whilst using relatively little processing
power. We validate our approach creating models of both a phaser
and a flanger effects pedal, and theoretically it can be applied to
any LFO modulated time-varying effect. In the best case, an errorto-signal ratio of 1.3% is achieved when modelling a flanger pedal,
and previous work has shown that this corresponds to the model
being nearly indistinguishable from the target device.
Download Differentiable IIR Filters for Machine Learning Applications In this paper we present an approach to using traditional digital IIR
filter structures inside deep-learning networks trained using backpropagation. We establish the link between such structures and
recurrent neural networks. Three different differentiable IIR filter
topologies are presented and compared against each other and an
established baseline. Additionally, a simple Wiener-Hammerstein
model using differentiable IIRs as its filtering component is presented and trained on a guitar signal played through a Boss DS-1
guitar pedal.
Download Recognizing Guitar Effects and Their Parameter Settings Guitar effects are commonly used in popular music to shape the
guitar sound to fit specific genres or to create more variety within
musical compositions. The sound is not only determined by the
choice of the guitar effect, but also heavily depends on the parameter settings of the effect. This paper introduces a method to
estimate the parameter settings of guitar effects, which makes it
possible to reconstruct the effect and its settings from an audio
recording of a guitar. The method utilizes audio feature extraction and shallow neural networks, which are trained on data created specifically for this task. The results show that the method
is generally suited for this task with average estimation errors of
±5% − ±16% of different parameter scales and could potentially
perform near the level of a human expert.
Download Diet Deep Generative Audio Models With Structured Lottery Deep learning models have provided extremely successful solutions in most audio application fields. However, the high accuracy
of these models comes at the expense of a tremendous computation cost. This aspect is almost always overlooked in evaluating the
quality of proposed models. However, models should not be evaluated without taking into account their complexity. This aspect
is especially critical in audio applications, which heavily relies on
specialized embedded hardware with real-time constraints.
In this paper, we build on recent observations that deep models are highly overparameterized, by studying the lottery ticket hypothesis on deep generative audio models. This hypothesis states
that extremely efficient small sub-networks exist in deep models
and would provide higher accuracy than larger models if trained in
isolation. However, lottery tickets are found by relying on unstructured masking, which means that resulting models do not provide
any gain in either disk size or inference time. Instead, we develop
here a method aimed at performing structured trimming. We show
that this requires to rely on global selection and introduce a specific criterion based on mutual information.
First, we confirm the surprising result that smaller models provide higher accuracy than their large counterparts. We further
show that we can remove up to 95% of the model weights without significant degradation in accuracy. Hence, we can obtain very
light models for generative audio across popular methods such as
Wavenet, SING or DDSP, that are up to 100 times smaller with
commensurate accuracy. We study the theoretical bounds for embedding these models on Raspberry Pi and Arduino, and show that
we can obtain generative models on CPU with equivalent quality
as large GPU models. Finally, we discuss the possibility of implementing deep generative audio models on embedded platforms.
Download Adversarial Synthesis of Drum Sounds Recent advancements in generative audio synthesis have allowed for the development of creative tools for generation and
manipulation of audio. In this paper, a strategy is proposed for the
synthesis of drum sounds using generative adversarial networks
(GANs). The system is based on a conditional Wasserstein GAN,
which learns the underlying probability distribution of a dataset
compiled of labeled drum sounds. Labels are used to condition
the system on an integer value that can be used to generate audio
with the desired characteristics. Synthesis is controlled by an input
latent vector that enables continuous exploration and interpolation
of generated waveforms. Additionally we experiment with a training method that progressively learns to generate audio at different
temporal resolutions. We present our results and discuss the benefits of generating audio with GANs along with sound examples
and demonstrations.
Download Relative Music Loudness Estimation Using Temporal Convolutional Networks and a CNN Feature Extraction Front-End Relative music loudness estimation is a MIR task that consists in
dividing audio in segments of three classes: Foreground Music,
Background Music and No Music. Given the temporal correlation
of music, in this work we approach the task using a type of network
with the ability to model temporal context: the Temporal Convolutional Network (TCN). We propose two architectures: a TCN,
and a novel architecture resulting from the combination of a TCN
with a Convolutional Neural Network (CNN) front-end. We name
this new architecture CNN-TCN. We expect the CNN front-end to
work as a feature extraction strategy to achieve a more efficient usage of the network’s parameters. We use the OpenBMAT dataset
to train and test 40 TCN and 80 CNN-TCN models with two grid
searches over a set of hyper-parameters. We compare our models with the two best algorithms submitted to the tasks of music
detection and relative music loudness estimation in MIREX 2019.
All our models outperform the MIREX algorithms even when using a lower number of parameters. The CNN-TCN emerges as the
best architecture as all its models outperform all TCN models. We
show that adding a CNN front-end to a TCN can actually reduce
the number of parameters of the network while improving performance. The CNN front-end effectively works as a feature extractor producing consistent patterns that identify different combinations of music and non-music sounds and also helps in producing
a smoother output in comparison to the TCN models.
Download Blind Arbitrary Reverb Matching Reverb provides psychoacoustic cues that convey information concerning relative locations within an acoustical space. The need
arises often in audio production to impart an acoustic context on an
audio track that resembles a reference track. One tool for making
audio tracks appear to be recorded in the same space is by applying
reverb to a dry track that is similar to the reverb in a wet one. This
paper presents a model for the task of “reverb matching,” where
we attempt to automatically add artificial reverb to a track, making
it sound like it was recorded in the same space as a reference track.
We propose a model architecture for performing reverb matching
and provide subjective experimental results suggesting that the reverb matching model can perform as well as a human. We also
provide open source software for generating training data using an
arbitrary Virtual Studio Technology plug-in.
Download Optimization of Cascaded Parametric Peak and Shelving Filters With Backpropagation Algorithm Peak and shelving filters are parametric infinite impulse response
filters which are used for amplifying or attenuating a certain frequency band. Shelving filters are parametrized by their cut-off frequency and gain, and peak filters by center frequency, bandwidth
and gain. Such filters can be cascaded in order to perform audio processing tasks like equalization, spectral shaping and modelling of complex transfer functions. Such a filter cascade allows
independent optimization of the mentioned parameters of each filter. For this purpose, a novel approach is proposed for deriving
the necessary local gradients with respect to the control parameters and for applying the instantaneous backpropagation algorithm
to deduce the gradient flow through a cascaded structure. Additionally, the performance of such a filter cascade adapted with the
proposed method, is exhibited for head-related transfer function
modelling, as an example application.
Download Audio Morphing Using Matrix Decomposition and Optimal Transport This paper presents a system for morphing between audio recordings in a continuous parameter space.
The proposed approach
combines matrix decompositions used for audio source separation with displacement interpolation enabled by 1D optimal transport. By interpolating the spectral components obtained using nonnegative matrix factorization of the source and target signals, the
system allows varying the timbre of a sound in real time, while
maintaining its temporal structure. Using harmonic / percussive
source separation as a pre-processing step, the system affords more
detailed control of the interpolation in perceptually meaningful dimensions.