Download A Segmental Spectro-Temporal Model of Musical Timbre We propose a new statistical model of musical timbre that handles the different segments of the temporal envelope (attack, sustain and release) separately in order to account for their different spectral and temporal behaviors. The model is based on a reduced-dimensionality representation of the spectro-temporal envelope. Temporal coefficients corresponding to the attack and release segments are subjected to explicit trajectory modeling based on a non-stationary Gaussian Process. Coefficients corresponding to the sustain phase are modeled as a multivariate Gaussian. A compound similarity measure associated with the segmental model is proposed and successfully tested in instrument classification experiments. Apart from its use in a statistical framework, the modeling method allows intuitive and informative visualizations of the characteristics of musical timbre.
Download Fully Conditioned and Low-Latency Black-Box Modeling of Analog Compression Neural networks have been found suitable for virtual analog modeling applications. Several analog audio effects have been successfully modeled with deep learning techniques, using low-latency and conditioned architectures suitable for real-world applications. Challenges remain with effects presenting more complex responses, such as nonlinear and time-varying input-output relationships. This paper proposes a deep-learning model for the analog compression effect. The architecture we introduce is fully conditioned by the device control parameters and it works on small audio segments, allowing low-latency real-time implementations. The architecture is used to model the CL 1B analog optical compressor, showing an overall high accuracy and ability to capture the different attack and release compression profiles. The proposed architecture’ ability to model audio compression behaviors is also verified using datasets from other compressors. Limitations remain with heavy compression scenarios determined by the conditioning parameters.
Download Designing a Library for Generative Audio in Unity This paper overviews URALi, a library designed to add generative sound synthesis capabilities to Unity. This project, in particular, is directed towards audiovisual artists keen on working with algorithmic systems in Unity but can not find native solutions for procedural sound synthesis to pair with their visual and control ones. After overviewing the options available in Unity concerning audio, this paper reports on the functioning and architecture of the library, which is an ongoing project.
Download Neural Parametric Equalizer Matching Using Differentiable Biquads This paper proposes a neural network for carrying out parametric equalizer (EQ) matching. The novelty of this neural network
solution is that it can be optimized directly in the frequency domain by means of differentiable biquads, rather than relying solely
on a loss on parameter values which does not correlate directly
with the system output. We compare the performance of the proposed neural network approach with that of a baseline algorithm
based on a convex relaxation of the problem. It is observed that the
neural network can provide better matching than the baseline approach because it directly attempts to solve the non-convex problem. Moreover, we show that the same network trained with only
a parameter loss is insufficient for the task, despite the fact that it
matches underlying EQ parameters better than one trained with a
combination of spectral and parameter losses.
Download Informed Selection of Frames for Music Similarity Computation In this paper we present a new method to compute frame based audio similarities, based on nearest neighbour density estimation. We do not recommend it is as a practical method for large collections because of the high runtime. Rather, we use this new method for a detailed analysis to get a deeper insight on how a bag of frames approach (BOF) determines similarities among songs, and in particular, to identify those audio frames that make two songs similar from a machine’s point of view. Our analysis reveals that audio frames of very low energy, which are of course not the most salient with respect to human perception, have a surprisingly big influence on current similarity measures. Based on this observation we propose to remove these low-energy frames before computing song models and show, via classification experiments, that the proposed frame selection strategy improves the audio similarity measure.
Download Modeling the Impulse Response of Higher-Order Microphone Arrays Using Differentiable Feedback Delay Networks Recently, differentiable multiple-input multiple-output Feedback
Delay Networks (FDNs) have been proposed for modeling target multichannel room impulse responses by optimizing their parameters according to perceptually-driven time-domain descriptors. However, in spatial audio applications, frequency-domain
characteristics and inter-channel differences are crucial for accurately replicating a given soundfield. In this article, targeting the
modeling of the response of higher-order microphone arrays, we
improve on the methodology by optimizing the FDN parameters
using a novel spatially-informed loss function, demonstrating its
superior performance over previous approaches and paving the
way toward the use of differentiable FDNs in spatial audio applications such as soundfield reconstruction and rendering.
Download Semi-automatic Ambience Generation Ambiances are background recordings used in audiovisual productions to make listeners feel they are in places like a pub or a farm. Accessing to commercially available atmosphere libraries is a convenient alternative to sending teams to record ambiances yet they limit the creation in different ways. First, they are already mixed, which reduces the flexibility to add, remove individual sounds or change its panning. Secondly, the number of ambient libraries is limited. We propose a semi-automatic system for ambiance generation. The system creates ambiances on demand given text queries by fetching relevant sounds from a large sound effect database and importing them into a sequencer multitrack project. Ambiances of diverse nature can be created easily. Several controls are provided to the users to refine the type of samples and the sound arrangement.
Download Antiderivative Antialiasing for Recurrent Neural Networks Neural networks have become invaluable for general audio processing tasks, such as virtual analog modeling of nonlinear audio equipment.
For sequence modeling tasks in particular, recurrent neural networks (RNNs) have gained widespread adoption in recent years. Their general applicability and effectiveness
stems partly from their inherent nonlinearity, which makes them
prone to aliasing. Recent work has explored mitigating aliasing
by oversampling the network—an approach whose effectiveness is
directly linked with the incurred computational costs. This work
explores an alternative route by extending the antiderivative antialiasing technique to explicit, computable RNNs. Detailed applications to the Gated Recurrent Unit and Long Short-Term Memory cell are shown as case studies. The proposed technique is evaluated
on multiple pre-trained guitar amplifier models, assessing its impact on the amount of aliasing and model tonality. The method is
shown to reduce the models’ tendency to alias considerably across
all considered sample rates while only affecting their tonality moderately, without requiring high oversampling factors. The results
of this study can be used to improve sound quality in neural audio
processing tasks that employ a suitable class of RNNs. Additional
materials are provided in the accompanying webpage.
Download Neural Music Instrument Cloning From Few Samples Neural music instrument cloning is an application of deep neural networks for imitating the timbre of a particular music instrument recording with a trained neural network. One can create such clones using an approach such as DDSP [1], which has been shown to achieve good synthesis quality for several instrument types [2]. However, this approach needs about ten minutes of audio data from the instrument of interest (target recording audio). In this work, we modify the DDSP architecture and apply transfer learning techniques used in speech voice cloning [3] to significantly reduce the amount of target recording audio required. We compare various cloning approaches and architectures across durations of target recording audio, ranging from four to 256 seconds. We demonstrate editing of loudness and pitch as well as timbre transfer from only 16 seconds of target recording audio. Our code is available online1 as well as many audio examples.2
Download Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data Recent years have seen increasing interest in applying deep learning methods to the modeling of guitar amplifiers or effect pedals. Existing methods are mainly based on the supervised approach, requiring temporally-aligned data pairs of unprocessed and rendered audio. However, this approach does not scale well, due to the complicated process involved in creating the data pairs. A very recent work done by Wright et al. has explored the potential of leveraging unpaired data for training, using a generative adversarial network (GAN)-based framework. This paper extends their work by using more advanced discriminators in the GAN, and using more unpaired data for training. Specifically, drawing inspiration from recent advancements in neural vocoders, we employ in our GANbased model for guitar amplifier modeling two sets of discriminators, one based on multi-scale discriminator (MSD) and the other multi-period discriminator (MPD). Moreover, we experiment with adding unprocessed audio signals that do not have the corresponding rendered audio of a target tone to the training data, to see how much the GAN model benefits from the unpaired data. Our experiments show that the proposed two extensions contribute to the modeling of both low-gain and high-gain guitar amplifiers.