Download Differentiable Time–frequency Scattering on GPU
Joint time–frequency scattering (JTFS) is a convolutional operator in the time–frequency domain which extracts spectrotemporal modulations at various rates and scales. It offers an idealized model of spectrotemporal receptive fields (STRF) in the primary auditory cortex, and thus may serve as a biological plausible surrogate for human perceptual judgments at the scale of isolated audio events. Yet, prior implementations of JTFS and STRF have remained outside of the standard toolkit of perceptual similarity measures and evaluation methods for audio generation. We trace this issue down to three limitations: differentiability, speed, and flexibility. In this paper, we present an implementation of time–frequency scattering in Python. Unlike prior implementations, ours accommodates NumPy, PyTorch, and TensorFlow as backends and is thus portable on both CPU and GPU. We demonstrate the usefulness of JTFS via three applications: unsupervised manifold learning of spectrotemporal modulations, supervised classification of musical instruments, and texture resynthesis of bioacoustic sounds.
Download A Study of Control Methods for Percussive Sound Synthesis Based on Gans
The process of creating drum sounds has seen significant evolution in the past decades. The development of analogue drum synthesizers, such as the TR-808, and modern sound design tools in Digital Audio Workstations led to a variety of drum timbres that defined entire musical genres. Recently, drum synthesis research has been revived with a new focus on training generative neural networks to create drum sounds. Different interfaces have previously been proposed to control the generative process, from low-level latent space navigation to high-level semantic feature parameterisation, but no comprehensive analysis has been presented to evaluate how each approach relates to the creative process. We aim to evaluate how different interfaces support creative control over drum generation by conducting a user study based on the Creative Support Index. We experiment with both a supervised method that decodes semantic latent space directions and an unsupervised Closed-Form Factorization approach from computer vision literature to parameterise the generation process and demonstrate that the latter is the preferred means to control a drum synthesizer based on the StyleGAN2 network architecture.
Download HD-AD: A New Approach to Audio Atomic Decomposition with Hyperdimensional Computing
In this paper, we approach the problem of atomic decomposition of audio at the symbolic level of atom parameters through the lens of hyperdimensional computing (HDC) – a non-traditional computing paradigm. Existing atomic decomposition algorithms often operate using waveforms from a redundant dictionary of atoms causing them to become increasingly memory/computationally intensive as the signal length grows and/or the atoms become more complicated. We systematically build an atom encoding using vector function architecture (VFA), a field of HDC. We train a neural network encoder on synthetic audio signals to generate these encodings and observe that the network can generalize to real recordings. This system, we call Hyperdimensional Atomic Decomposition (HD-AD), avoids time-domain correlations all together. Because HD-AD scales with the sparsity of the signal, rather than its length in time, atomic decompositions are often produced much faster than real-time.
Download Neural Music Instrument Cloning From Few Samples
Neural music instrument cloning is an application of deep neural networks for imitating the timbre of a particular music instrument recording with a trained neural network. One can create such clones using an approach such as DDSP [1], which has been shown to achieve good synthesis quality for several instrument types [2]. However, this approach needs about ten minutes of audio data from the instrument of interest (target recording audio). In this work, we modify the DDSP architecture and apply transfer learning techniques used in speech voice cloning [3] to significantly reduce the amount of target recording audio required. We compare various cloning approaches and architectures across durations of target recording audio, ranging from four to 256 seconds. We demonstrate editing of loudness and pitch as well as timbre transfer from only 16 seconds of target recording audio. Our code is available online1 as well as many audio examples.2
Download Physical Modeling Using Recurrent Neural Networks with Fast Convolutional Layers
Discrete-time modeling of acoustic, mechanical and electrical systems is a prominent topic in the musical signal processing literature. Such models are mostly derived by discretizing a mathematical model, given in terms of ordinary or partial differential equations, using established techniques. Recent work has applied the techniques of machine-learning to construct such models automatically from data for the case of systems which have lumped states described by scalar values, such as electrical circuits. In this work, we examine how similar techniques are able to construct models of systems which have spatially distributed rather than lumped states. We describe several novel recurrent neural network structures, and show how they can be thought of as an extension of modal techniques. As a proof of concept, we generate synthetic data for three physical systems and show that the proposed network structures can be trained with this data to reproduce the behavior of these systems.
Download A Direct Microdynamics Adjusting Processor with Matching Paradigm and Differentiable Implementation
In this paper, we propose a new processor capable of directly changing the microdynamics of an audio signal primarily via a single dedicated user-facing parameter. The novelty of our processor is that it has built into it a measure of relative level, a short-term signal strength measurement which is robust to changes in signal macrodynamics. Consequent dynamic range processing is signal level-independent in its nature, and attempts to directly alter its observed relative level measurements. The inclusion of such a meter within our proposed processor also gives rise to a natural solution to the dynamics matching problem, where we attempt to transfer the microdynamic characteristics of one audio recording to another by means of estimating appropriate settings for the processor. We suggest a means of providing a reasonable initial guess for processor settings, followed by an efficient iterative algorithm to refine upon our estimates. Additionally, we implement the processor as a differentiable recurrent layer and show its effectiveness when wrapped around a gradient descent optimizer within a deep learning framework. Moreover, we illustrate that the proposed processor has more favorable gradient characteristics relative to a conventional dynamic range compressor. Throughout, we consider extensions of the processor, matching algorithm, and differentiable implementation for the multiband case.
Download Model Bending: Teaching Circuit Models New Tricks
A technique is introduced for generating novel signal processing systems grounded in analog electronic circuits, called model bending. By applying the ideas behind circuit bending to models of nonlinear analog circuits it is possible to create novel nonlinear signal processors which mimic the behavior of analog electronics, but which are not possible to implement in the analog realm. The history of both circuit bending and circuit modeling is discussed, as well as a theoretical basis for how these approaches can complement each other. Potential pitfalls to the practical application of model bending are highlighted and suggested solutions to those problems are provided, with examples.
Download Neural Net Tube Models for Wave Digital Filters
Herein, we demonstrate the use of neural nets towards simulating multiport nonlinearities inside a wave digital filter. We introduce a resolved wave definition which allows us to extract features from a Kirchhoff domain dataset and train our neural networks directly in the wave domain. A hyperparameter search is performed to minimize error and runtime complexity. To illustrate the method, we model a tube amplifier circuit inspired by the preamplifier stage of the Fender Pro-Junior guitar amplifier. We analyze the performance of our neural nets models by comparing their distortion characteristics and transconductances. Our results suggest that activation function selection has a significant effect on the distortion characteristic created by the neural net.
Download Realistic Gramophone Noise Synthesis Using a Diffusion Model
This paper introduces a novel data-driven strategy for synthesizing gramophone noise audio textures. A diffusion probabilistic model is applied to generate highly realistic quasiperiodic noises. The proposed model is designed to generate samples of length equal to one disk revolution, but a method to generate plausible periodic variations between revolutions is also proposed. A guided approach is also applied as a conditioning method, where an audio signal generated with manually-tuned signal processing is refined via reverse diffusion to improve realism. The method has been evaluated in a subjective listening test, in which the participants were often unable to recognize the synthesized signals from the real ones. The synthetic noises produced with the best proposed unconditional method are statistically indistinguishable from real noise recordings. This work shows the potential of diffusion models for highly realistic audio synthesis tasks.
Download Joint Estimation of Fader and Equalizer Gains of DJ Mixers Using Convex Optimization
Disc jockeys (DJs) use audio effects to make a smooth transition from one song to another. There have been attempts to computationally analyze the creative process of seamless mixing. However, only a few studies estimated fader or equalizer (EQ) gains controlled by DJs. In this study, we propose a method that jointly estimates time-varying fader and EQ gains so as to reproduce the mix from individual source tracks. The method approximates the equalizer filters with a linear combination of a fixed equalizer filter and a constant gain to convert the joint estimation into a convex optimization problem. For the experiment, we collected a new DJ mix dataset that consists of 5,040 real-world DJ mixes with 50,742 transitions, and evaluated the proposed method with a mix reconstruction error. The result shows that the proposed method estimates the time-varying fader and equalizer gains more accurately than existing methods and simple baselines.