Download Two Datasets of Room Impulse Responses for Navigation in Six Degrees-of-Freedom:a Symphonic Concert Hall and a Former Planetarium
This paper presents two datasets of room impulse responses (RIRs) for navigable virtual acoustics. The first is a set of 240 mono and Ambisonic RIRs recorded at the Maison Symphonique, a symphonic concert hall in Montreal renowned for its great acoustic characteristics. The second is a set of 67 third-order Ambisonic RIRs which was recorded in the former planetarium of Montreal (currently known as the Centech), a space where the room acoustic includes an acoustic focal point where extreme reverberation times occur. The article first describes the two datasets and the methods that were used to capture them. A use case for these RIRs is then presented: an audio rendering of scene navigation using interpolation among RIRs.
Download Fast Temporal Convolutions for Real-Time Audio Signal Processing
This paper introduces the possibilities of optimizing neural network convolutional layers for modeling nonlinear audio systems and effects. Enhanced methods for real-time dilated convolutions are presented to achieve faster signal processing times than in previous work. Due to the improved implementation of convolutional layers, a significant decrease in computational requirements was observed and validated on different configurations of single layers with dilated convolutions and WaveNet-style feedforward neural network models. In most cases, equivalent signal processing times were achieved to those using recurrent neural networks with Long Short-Term Memory units and Gated Recurrent Units, which are considered state-of-the-art in the field of black-box virtual analog modeling.
Download Realistic Gramophone Noise Synthesis Using a Diffusion Model
This paper introduces a novel data-driven strategy for synthesizing gramophone noise audio textures. A diffusion probabilistic model is applied to generate highly realistic quasiperiodic noises. The proposed model is designed to generate samples of length equal to one disk revolution, but a method to generate plausible periodic variations between revolutions is also proposed. A guided approach is also applied as a conditioning method, where an audio signal generated with manually-tuned signal processing is refined via reverse diffusion to improve realism. The method has been evaluated in a subjective listening test, in which the participants were often unable to recognize the synthesized signals from the real ones. The synthetic noises produced with the best proposed unconditional method are statistically indistinguishable from real noise recordings. This work shows the potential of diffusion models for highly realistic audio synthesis tasks.
Download A Study of Control Methods for Percussive Sound Synthesis Based on Gans
The process of creating drum sounds has seen significant evolution in the past decades. The development of analogue drum synthesizers, such as the TR-808, and modern sound design tools in Digital Audio Workstations led to a variety of drum timbres that defined entire musical genres. Recently, drum synthesis research has been revived with a new focus on training generative neural networks to create drum sounds. Different interfaces have previously been proposed to control the generative process, from low-level latent space navigation to high-level semantic feature parameterisation, but no comprehensive analysis has been presented to evaluate how each approach relates to the creative process. We aim to evaluate how different interfaces support creative control over drum generation by conducting a user study based on the Creative Support Index. We experiment with both a supervised method that decodes semantic latent space directions and an unsupervised Closed-Form Factorization approach from computer vision literature to parameterise the generation process and demonstrate that the latter is the preferred means to control a drum synthesizer based on the StyleGAN2 network architecture.
Download A Direct Microdynamics Adjusting Processor with Matching Paradigm and Differentiable Implementation
In this paper, we propose a new processor capable of directly changing the microdynamics of an audio signal primarily via a single dedicated user-facing parameter. The novelty of our processor is that it has built into it a measure of relative level, a short-term signal strength measurement which is robust to changes in signal macrodynamics. Consequent dynamic range processing is signal level-independent in its nature, and attempts to directly alter its observed relative level measurements. The inclusion of such a meter within our proposed processor also gives rise to a natural solution to the dynamics matching problem, where we attempt to transfer the microdynamic characteristics of one audio recording to another by means of estimating appropriate settings for the processor. We suggest a means of providing a reasonable initial guess for processor settings, followed by an efficient iterative algorithm to refine upon our estimates. Additionally, we implement the processor as a differentiable recurrent layer and show its effectiveness when wrapped around a gradient descent optimizer within a deep learning framework. Moreover, we illustrate that the proposed processor has more favorable gradient characteristics relative to a conventional dynamic range compressor. Throughout, we consider extensions of the processor, matching algorithm, and differentiable implementation for the multiband case.
Download Virtual Analog Modeling of Distortion Circuits Using Neural Ordinary Differential Equations
Recent research in deep learning has shown that neural networks can learn differential equations governing dynamical systems. In this paper, we adapt this concept to Virtual Analog (VA) modeling to learn the ordinary differential equations (ODEs) governing the first-order and the second-order diode clipper. The proposed models achieve performance comparable to state-of-the-art recurrent neural networks (RNNs) albeit using fewer parameters. We show that this approach does not require oversampling and allows to increase the sampling rate after the training has completed, which results in increased accuracy. Using a sophisticated numerical solver allows to increase the accuracy at the cost of slower processing. ODEs learned this way do not require closed forms but are still physically interpretable.
Download On the Challenges of Embedded Real-Time Music Information Retrieval
Real-time applications of Music Information Retrieval (MIR) have been gaining interest as of recently. However, as deep learning becomes more and more ubiquitous for music analysis tasks, several challenges and limitations need to be overcome to deliver accurate and quick real-time MIR systems. In addition, modern embedded computers offer great potential for compact systems that use MIR algorithms, such as digital musical instruments. However, embedded computing hardware is generally resource constrained, posing additional limitations. In this paper, we identify and discuss the challenges and limitations of embedded real-time MIR. Furthermore, we discuss potential solutions to these challenges, and demonstrate their validity by presenting an embedded real-time classifier of expressive acoustic guitar techniques. The classifier achieved 99.2% accuracy in distinguishing pitched and percussive techniques and a 99.1% average accuracy in distinguishing four distinct percussive techniques with a fifth class for pitched sounds. The full classification task is a considerably more complex learning problem, with our preliminary results reaching only 56.5% accuracy. The results were produced with an average latency of 30.7 ms.
Download Modeling and Extending the Rca Mark Ii Sound Effects Filter
We have analyzed the Sound Effects Filter from the one-of-a-kind RCA Mark II sound synthesizer and modeled it as a Wave Digital Filter using the Faust language, to make this once exclusive device widely available. By studying the original schematics and measurements of the device, we discovered several circuit modifications. Building on these, we proposed a number of extensions to the circuit which increase its usefulness in music production.
Download Feature-Informed Latent Space Regularization for Music Source Separation
The integration of additional side information to improve music source separation has been investigated numerous times, e.g., by adding features to the input or by adding learning targets in a multi-task learning scenario. These approaches, however, require additional annotations such as musical scores, instrument labels, etc. in training and possibly during inference. The available datasets for source separation do not usually provide these additional annotations. In this work, we explore transfer learning strategies to incorporate VGGish features with a state-of-the-art source separation model; VGGish features are known to be a very condensed representation of audio content and have been successfully used in many music information retrieval tasks. We introduce three approaches to incorporate the features, including two latent space regularization methods and one naive concatenation method. Our preliminary results show that our proposed approaches could improve some evaluation metrics for music source separation. In this work, we also include a discussion of our proposed approaches, such as the pros and cons of each approach, and the potential extension/improvement.
Download A Comparison of Deep Learning Inference Engines for Embedded Real-Time Audio Classification
Recent advancements in deep learning have shown great potential for audio applications, improving the accuracy of previous solutions for tasks such as music transcription, beat detection, and real-time audio processing. In addition, the availability of increasingly powerful embedded computers has led many deep learning framework developers to devise software optimized to run pretrained models in resource-constrained contexts. As a result, the use of deep learning on embedded devices and audio plugins has become more widespread. However, confusion has been rising around deep learning inference engines, regarding which of these can run in real-time and which are less resource-hungry. In this paper, we present a comparison of four available deep learning inference engines for real-time audio classification on the CPU of an embedded single-board computer: TensorFlow Lite, TorchScript, ONNX Runtime, and RTNeural. Results show that all inference engines can execute neural network models in real-time with appropriate code practices, but execution time varies between engines and models. Most importantly, we found that most of the less-specialized engines offer great flexibility and can be used effectively for real-time audio classification, with slightly better results than a real-time-specific approach. In contrast, more specialized solutions can offer a lightweight and minimalist alternative where less flexibility is needed.