Download Differentiable IIR Filters for Machine Learning Applications In this paper we present an approach to using traditional digital IIR
filter structures inside deep-learning networks trained using backpropagation. We establish the link between such structures and
recurrent neural networks. Three different differentiable IIR filter
topologies are presented and compared against each other and an
established baseline. Additionally, a simple Wiener-Hammerstein
model using differentiable IIRs as its filtering component is presented and trained on a guitar signal played through a Boss DS-1
guitar pedal.
Download Recognizing Guitar Effects and Their Parameter Settings Guitar effects are commonly used in popular music to shape the
guitar sound to fit specific genres or to create more variety within
musical compositions. The sound is not only determined by the
choice of the guitar effect, but also heavily depends on the parameter settings of the effect. This paper introduces a method to
estimate the parameter settings of guitar effects, which makes it
possible to reconstruct the effect and its settings from an audio
recording of a guitar. The method utilizes audio feature extraction and shallow neural networks, which are trained on data created specifically for this task. The results show that the method
is generally suited for this task with average estimation errors of
±5% − ±16% of different parameter scales and could potentially
perform near the level of a human expert.
Download Adversarial Synthesis of Drum Sounds Recent advancements in generative audio synthesis have allowed for the development of creative tools for generation and
manipulation of audio. In this paper, a strategy is proposed for the
synthesis of drum sounds using generative adversarial networks
(GANs). The system is based on a conditional Wasserstein GAN,
which learns the underlying probability distribution of a dataset
compiled of labeled drum sounds. Labels are used to condition
the system on an integer value that can be used to generate audio
with the desired characteristics. Synthesis is controlled by an input
latent vector that enables continuous exploration and interpolation
of generated waveforms. Additionally we experiment with a training method that progressively learns to generate audio at different
temporal resolutions. We present our results and discuss the benefits of generating audio with GANs along with sound examples
and demonstrations.
Download Audio Morphing Using Matrix Decomposition and Optimal Transport This paper presents a system for morphing between audio recordings in a continuous parameter space.
The proposed approach
combines matrix decompositions used for audio source separation with displacement interpolation enabled by 1D optimal transport. By interpolating the spectral components obtained using nonnegative matrix factorization of the source and target signals, the
system allows varying the timbre of a sound in real time, while
maintaining its temporal structure. Using harmonic / percussive
source separation as a pre-processing step, the system affords more
detailed control of the interpolation in perceptually meaningful dimensions.
Download Diet Deep Generative Audio Models With Structured Lottery Deep learning models have provided extremely successful solutions in most audio application fields. However, the high accuracy
of these models comes at the expense of a tremendous computation cost. This aspect is almost always overlooked in evaluating the
quality of proposed models. However, models should not be evaluated without taking into account their complexity. This aspect
is especially critical in audio applications, which heavily relies on
specialized embedded hardware with real-time constraints.
In this paper, we build on recent observations that deep models are highly overparameterized, by studying the lottery ticket hypothesis on deep generative audio models. This hypothesis states
that extremely efficient small sub-networks exist in deep models
and would provide higher accuracy than larger models if trained in
isolation. However, lottery tickets are found by relying on unstructured masking, which means that resulting models do not provide
any gain in either disk size or inference time. Instead, we develop
here a method aimed at performing structured trimming. We show
that this requires to rely on global selection and introduce a specific criterion based on mutual information.
First, we confirm the surprising result that smaller models provide higher accuracy than their large counterparts. We further
show that we can remove up to 95% of the model weights without significant degradation in accuracy. Hence, we can obtain very
light models for generative audio across popular methods such as
Wavenet, SING or DDSP, that are up to 100 times smaller with
commensurate accuracy. We study the theoretical bounds for embedding these models on Raspberry Pi and Arduino, and show that
we can obtain generative models on CPU with equivalent quality
as large GPU models. Finally, we discuss the possibility of implementing deep generative audio models on embedded platforms.
Download Relative Music Loudness Estimation Using Temporal Convolutional Networks and a CNN Feature Extraction Front-End Relative music loudness estimation is a MIR task that consists in
dividing audio in segments of three classes: Foreground Music,
Background Music and No Music. Given the temporal correlation
of music, in this work we approach the task using a type of network
with the ability to model temporal context: the Temporal Convolutional Network (TCN). We propose two architectures: a TCN,
and a novel architecture resulting from the combination of a TCN
with a Convolutional Neural Network (CNN) front-end. We name
this new architecture CNN-TCN. We expect the CNN front-end to
work as a feature extraction strategy to achieve a more efficient usage of the network’s parameters. We use the OpenBMAT dataset
to train and test 40 TCN and 80 CNN-TCN models with two grid
searches over a set of hyper-parameters. We compare our models with the two best algorithms submitted to the tasks of music
detection and relative music loudness estimation in MIREX 2019.
All our models outperform the MIREX algorithms even when using a lower number of parameters. The CNN-TCN emerges as the
best architecture as all its models outperform all TCN models. We
show that adding a CNN front-end to a TCN can actually reduce
the number of parameters of the network while improving performance. The CNN front-end effectively works as a feature extractor producing consistent patterns that identify different combinations of music and non-music sounds and also helps in producing
a smoother output in comparison to the TCN models.
Download Neural Parametric Equalizer Matching Using Differentiable Biquads This paper proposes a neural network for carrying out parametric equalizer (EQ) matching. The novelty of this neural network
solution is that it can be optimized directly in the frequency domain by means of differentiable biquads, rather than relying solely
on a loss on parameter values which does not correlate directly
with the system output. We compare the performance of the proposed neural network approach with that of a baseline algorithm
based on a convex relaxation of the problem. It is observed that the
neural network can provide better matching than the baseline approach because it directly attempts to solve the non-convex problem. Moreover, we show that the same network trained with only
a parameter loss is insufficient for the task, despite the fact that it
matches underlying EQ parameters better than one trained with a
combination of spectral and parameter losses.
Download Blind Arbitrary Reverb Matching Reverb provides psychoacoustic cues that convey information concerning relative locations within an acoustical space. The need
arises often in audio production to impart an acoustic context on an
audio track that resembles a reference track. One tool for making
audio tracks appear to be recorded in the same space is by applying
reverb to a dry track that is similar to the reverb in a wet one. This
paper presents a model for the task of “reverb matching,” where
we attempt to automatically add artificial reverb to a track, making
it sound like it was recorded in the same space as a reference track.
We propose a model architecture for performing reverb matching
and provide subjective experimental results suggesting that the reverb matching model can perform as well as a human. We also
provide open source software for generating training data using an
arbitrary Virtual Studio Technology plug-in.
Download GPGPU Patterns for Serial and Parallel Audio Effects Modern commodity GPUs offer high numerical throughput per
unit of cost, but often sit idle during audio workstation tasks. Various researches in the field have shown that GPUs excel at tasks
such as Finite-Difference Time-Domain simulation and wavefield
synthesis. Concrete implementations of several such projects are
available for use.
Benchmarks and use cases generally concentrate on running
one project on a GPU. Running multiple such projects simultaneously is less common, and reduces throughput. In this work
we list some concerns when running multiple heterogeneous tasks
on the GPU. We apply optimization strategies detailed in developer documentation and commercial CUDA literature, and show
results through the lens of real-time audio tasks. We benchmark
the cases of (i) a homogeneous effect chain made of previously
separate effects, and (ii) a synthesizer with distinct, parallelizable
sound generators.