Download Neural Audio Processing on Android Phones
This study investigates the potential of real-time inference of neural audio effects on Android smartphones, marking an initial step towards bridging the gap in neural audio processing for mobile devices. Focusing exclusively on processing rather than synthesis, we explore the performance of three open-source neural models across five Android phones released between 2014 and 2022, showcasing varied capabilities due to their generational differences. Through comparative analysis utilizing two C++ inference engines (ONNX Runtime and RTNeural), we aim to evaluate the computational efficiency and timing performance of these models, considering the varying computational loads and the hardware specifics of each device. Our work contributes insights into the feasibility of implementing neural audio processing in real-time on mobile platforms, highlighting challenges and opportunities for future advancements in this rapidly evolving field.
Download Synthesizer Sound Matching Using Audio Spectrogram Transformers
Systems for synthesizer sound matching, which automatically set the parameters of a synthesizer to emulate an input sound, have the potential to make the process of synthesizer programming faster and easier for novice and experienced musicians alike, whilst also affording new means of interaction with synthesizers. Considering the enormous variety of synthesizers in the marketplace, and the complexity of many of them, general-purpose sound matching systems that function with minimal knowledge or prior assumptions about the underlying synthesis architecture are particularly desirable. With this in mind, we introduce a synthesizer sound matching model based on the Audio Spectrogram Transformer. We demonstrate the viability of this model by training on a large synthetic dataset of randomly generated samples from the popular Massive synthesizer. We show that this model can reconstruct parameters of samples generated from a set of 16 parameters, highlighting its improved fidelity relative to multi-layer perceptron and convolutional neural network baselines. We also provide audio examples demonstrating the out-of-domain model performance in emulating vocal imitations, and sounds from other synthesizers and musical instruments.
Download Balancing Error and Latency of Black-Box Models for Audio Effects Using Hardware-Aware Neural Architecture Search
In this paper, we address automating and systematizing the process of finding black-box models for virtual analogue audio effects with an optimal balance between error and latency. We introduce a multi-objective optimization approach based on hardware-aware neural architecture search which allows specifying the optimization balance of model error and latency according to the requirements of the application. By using a regularized evolutionary algorithm, it is able to navigate through a huge search space systematically. Additionally, we propose a search space for modelling non-linear dynamic audio effects consisting of over 41 trillion different WaveNet-style architectures. We evaluate its performance and usefulness by yielding highly effective architectures, either up to 18× faster or with a test loss of up to 56% less than the best performing models of the related work, while still showing a favourable trade-off. We can conclude that hardware-aware neural architecture search is a valuable tool that can help researchers and engineers developing virtual analogue models by automating the architecture design and saving time by avoiding manual search and evaluation through trial-and-error.
Download Audio-Visual Talker Localization in Video for Spatial Sound Reproduction
Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating the automatic extraction of the positional metadata of the talker relative to the camera’s reference frame. With the integration of the visual modality, this study expands upon our previous investigation focused solely on audio-based active speaker detection and localization. Our experiments compare conventional audio-visual approaches for active speaker detection that leverage monaural audio, our previous audio-only method that leverages multichannel recordings from a microphone array, and a novel audio-visual approach integrating vision and multichannel audio. We found the role of the two modalities to complement each other. Multichannel audio, overcoming the problem of visual occlusions, provides a double-digit reduction in detection error compared to audio-visual methods with single-channel audio. The combination of multichannel audio and vision further enhances spatial accuracy, leading to a four-percentage point increase in F1 score on the Tragic Talkers dataset. Future investigations will assess the robustness of the model in noisy and highly reverberant environments, as well as tackle the problem of off-screen speakers.
Download Distortion Recovery: A Two-Stage Method for Guitar Effect Removal
Removing audio effects from electric guitar recordings makes it easier for post-production and sound editing. An audio distortion recovery model not only improves the clarity of the guitar sounds but also opens up new opportunities for creative adjustments in mixing and mastering. While progress have been made in creating such models, previous efforts have largely focused on synthetic distortions that may be too simplistic to accurately capture the complexities seen in real-world recordings. In this paper, we tackle the task by using a dataset of guitar recordings rendered with commercial-grade audio effect VST plugins. Moreover, we introduce a novel two-stage methodology for audio distortion recovery. The idea is to firstly process the audio signal in the Mel-spectrogram domain in the first stage, and then use a neural vocoder to generate the pristine original guitar sound from the processed Mel-spectrogram in the second stage. We report a set of experiments demonstrating the effectiveness of our approach over existing methods, through both subjective and objective evaluation metrics.
Download Sample Rate Independent Recurrent Neural Networks for Audio Effects Processing
In recent years, machine learning approaches to modelling guitar amplifiers and effects pedals have been widely investigated and have become standard practice in some consumer products. In particular, recurrent neural networks (RNNs) are a popular choice for modelling non-linear devices such as vacuum tube amplifiers and distortion circuitry. One limitation of such models is that they are trained on audio at a specific sample rate and therefore give unreliable results when operating at another rate. Here, we investigate several methods of modifying RNN structures to make them approximately sample rate independent, with a focus on oversampling. In the case of integer oversampling, we demonstrate that a previously proposed delay-based approach provides high fidelity sample rate conversion whilst additionally reducing aliasing. For non-integer sample rate adjustment, we propose two novel methods and show that one of these, based on cubic Lagrange interpolation of a delay-line, provides a significant improvement over existing methods. To our knowledge, this work provides the first in-depth study into this problem.
Download Network Bending of Diffusion Models for Audio-Visual Generation
In this paper we present the first steps towards the creation of a tool which enables artists to create music visualizations using pretrained, generative, machine learning models. First, we investigate the application of network bending, the process of applying transforms within the layers of a generative network, to image generation diffusion models by utilizing a range of point-wise, tensorwise, and morphological operators. We identify a number of visual effects that result from various operators, including some that are not easily recreated with standard image editing tools. We find that this process allows for continuous, fine-grain control of image generation which can be helpful for creative applications. Next, we generate music-reactive videos using Stable Diffusion by passing audio features as parameters to network bending operators. Finally, we comment on certain transforms which radically shift the image and the possibilities of learning more about the latent space of Stable Diffusion based on these transforms.
Download Wave Digital Model of the MXR Phase 90 Based on a Time-Varying Resistor Approximation of JFET Elements
Virtual Analog (VA) modeling is the practice of digitally emulating analog audio gear. Over the past few years, with the purpose of recreating the alleged distinctive sound of audio equipment and musicians, many different guitar pedals have been emulated by means of the VA paradigm but little attention has been given to phasers. Phasers process the spectrum of the input signal with time-varying notches by means of shifting stages typically realized with a network of transistors, whose nonlinear equations are, in general, demanding to be solved. In this paper, we take as a reference the famous MXR Phase 90 guitar pedal, and we propose an efficient time-varying model of its Junction Field-Effect Transistors (JFETs) based on a channel resistance approximation. We then employ such a model in the Wave Digital domain to emulate in real-time the guitar pedal, obtaining an implementation characterized by low computational cost and good accuracy.
Download Differentiable MIMO Feedback Delay Networks for Multichannel Room Impulse Response Modeling
Recently, with the advent of new performing headsets and goggles, the demand for Virtual and Augmented Reality applications has experienced a steep increase. In order to coherently navigate the virtual rooms, the acoustics of the scene must be emulated in the most accurate and efficient way possible. Amongst others, Feedback Delay Networks (FDNs) have proved to be valuable tools for tackling such a task. In this article, we expand and adapt a method recently proposed for the data-driven optimization of single-inputsingle-output FDNs to the multiple-input-multiple-output (MIMO) case for addressing spatial/space-time processing applications. By testing our methodology on items taken from two different datasets, we show that the parameters of MIMO FDNs can be jointly optimized to match some perceptual characteristics of given multichannel room impulse responses, overcoming approaches available in the literature, and paving the way toward increasingly efficient and accurate real-time virtual room acoustics rendering.
Download A Highly Parametrized Scattering Delay Network Implementation for Interactive Room Auralization
Scattering Delay Networks (SDNs) are an interesting approach to artificial reverberation, with parameters tied to the room’s physical properties and the computational efficiency of delay networks. This paper presents a highly-parametrized and real-time plugin of an SDN. The SDN plugin allows for interactive room auralization, enabling users to modify the parameters affecting the reverberation in real-time. These parameters include source and receiver positions, room shape and size, and wall absorption properties. This makes our plugin suitable for applications that require realtime and interactive spatial audio rendering, such as virtual or augmented reality frameworks and video games. Additionally, the main contributions of this work include a filter design method for wall sound absorption, as well as plugin features such as air absorption modeling, various output formats (mono, stereo, binaural, and first to fifth order Ambisonics), open sound control (OSC) for controlling source and receiver parameters, and a graphical user interface (GUI). Evaluation tests showed that the reverberation time and the filter design approach are consistent with both theoretical references and real-world measurements. Finally, performance analysis indicated that the SDN plugin requires minimal computational resources.