Download HRTF Spatial Upsampling in the Spherical Harmonics Domain Employing a Generative Adversarial Network A Head-Related Transfer Function (HRTF) is able to capture alterations a sound wave undergoes from its source before it reaches the entrances of a listener’s left and right ear canals, and is imperative for creating immersive experiences in virtual and augmented reality (VR/AR). Nevertheless, creating personalized HRTFs demands sophisticated equipment and is hindered by time-consuming data acquisition processes. To counteract these challenges, various techniques for HRTF interpolation and up-sampling have been proposed. This paper illustrates how Generative Adversarial Networks (GANs) can be applied to HRTF data upsampling in the spherical harmonics domain. We propose using Autoencoding Generative Adversarial Networks (AE-GAN) to upsample lowdegree spherical harmonics coefficients and get a more accurate representation of the full HRTF set. The proposed method is benchmarked against two baselines: barycentric interpolation and HRTF selection. Results from log-spectral distortion (LSD) evaluation suggest that the proposed AE-GAN has significant potential for upsampling very sparse HRTFs, achieving 17% improvement over baseline methods.
Download Sound Matching Using Synthesizer Ensembles Sound matching allows users to automatically approximate existing sounds using a synthesizer. Previous work has mostly focused on algorithms for automatically programming an existing synthesizer. This paper proposes a system for selecting between different synthesizer designs, each one with a corresponding automatic programmer. An implementation that allows designing ensembles based on a template is demonstrated. Several experiments are presented using a simple subtractive synthesis design. Using an ensemble of synthesizer-programmer pairs is shown to provide better matching than a single programmer trained for an equivalent integrated synthesizer. Scaling to hundreds of synthesizers is shown to improve match quality.
Download Audio Visualization via Delay Embedding and Subspace Learning We describe a sequence of methods for producing videos from audio signals. Our visualizations capture perceptual features like harmonicity and brightness: they produce stable images from periodic sounds and slowly-evolving images from inharmonic ones; they associate jagged shapes to brighter sounds and rounded shapes to darker ones. We interpret our methods as adaptive FIR filterbanks and show how, for larger values of the complexity parameters, we can perform accurate frequency detection without the Fourier transform. Attached to the paper is a code repository containing the Jupyter notebook used to generate the images and videos cited. We also provide code for a realtime C++ implementation of the simplest visualization method. We discuss the mathematical theory of our methods in the two appendices.
Download Wave Digital Modeling of Circuits with Multiple One-Port Nonlinearities Based on Lipschitz-Bounded Neural Networks Neural networks have found application within the Wave Digital Filters (WDFs) framework as data-driven input-output blocks for modeling single one-port or multi-port nonlinear devices in circuit systems. However, traditional neural networks lack predictable bounds for their output derivatives, essential to ensure convergence when simulating circuits with multiple nonlinear elements using fixed-point iterative methods, e.g., the Scattering Iterative Method (SIM). In this study, we address such issue by employing Lipschitz-bounded neural networks for regressing nonlinear WD scattering relations of one-port nonlinearities.
Download Leveraging Electric Guitar Tones and Effects to Improve Robustness in Guitar Tablature Transcription Modeling Guitar tablature transcription (GTT) aims at automatically generating symbolic representations from real solo guitar performances. Due to its applications in education and musicology, GTT has gained traction in recent years. However, GTT robustness has been limited due to the small size of available datasets. Researchers have recently used synthetic data that simulates guitar performances using pre-recorded or computer-generated tones, allowing for scalable and automatic data generation. The present study complements these efforts by demonstrating that GTT robustness can be improved by including synthetic training data created using recordings of real guitar tones played with different audio effects. We evaluate our approach on a new evaluation dataset with professional solo guitar performances that we composed and collected, featuring a wide array of tones, chords, and scales.
Download DDSP-SFX: Acoustically-Guided Sound Effects Generation with Differentiable Digital Signal Processing Controlling the variations of sound effects using neural audio synthesis models has been a challenging task. Differentiable digital signal processing (DDSP) provides a lightweight solution that achieves high-quality sound synthesis while enabling deterministic acoustic attribute control by incorporating pre-processed audio features and digital synthesizers. In this research, we introduce DDSP-SFX, a model based on the DDSP architecture capable of synthesizing high-quality sound effects while enabling users to control the timbre variations easily. We integrate a transient modelling algorithm in DDSP that achieves higher objective evaluation scores and subjective ratings over impulsive signals (footsteps, gunshots). We propose a novel method that achieves frame-level timbre variation control while also allowing deterministic attribute control. We further qualitatively show the timbre transfer performance using voice as the guiding sound.
Download Distortion Recovery: A Two-Stage Method for Guitar Effect Removal Removing audio effects from electric guitar recordings makes it easier for post-production and sound editing. An audio distortion recovery model not only improves the clarity of the guitar sounds but also opens up new opportunities for creative adjustments in mixing and mastering. While progress have been made in creating such models, previous efforts have largely focused on synthetic distortions that may be too simplistic to accurately capture the complexities seen in real-world recordings. In this paper, we tackle the task by using a dataset of guitar recordings rendered with commercial-grade audio effect VST plugins. Moreover, we introduce a novel two-stage methodology for audio distortion recovery. The idea is to firstly process the audio signal in the Mel-spectrogram domain in the first stage, and then use a neural vocoder to generate the pristine original guitar sound from the processed Mel-spectrogram in the second stage. We report a set of experiments demonstrating the effectiveness of our approach over existing methods, through both subjective and objective evaluation metrics.
Download Differentiable MIMO Feedback Delay Networks for Multichannel Room Impulse Response Modeling Recently, with the advent of new performing headsets and goggles, the demand for Virtual and Augmented Reality applications has experienced a steep increase. In order to coherently navigate the virtual rooms, the acoustics of the scene must be emulated in the most accurate and efficient way possible. Amongst others, Feedback Delay Networks (FDNs) have proved to be valuable tools for tackling such a task. In this article, we expand and adapt a method recently proposed for the data-driven optimization of single-inputsingle-output FDNs to the multiple-input-multiple-output (MIMO) case for addressing spatial/space-time processing applications. By testing our methodology on items taken from two different datasets, we show that the parameters of MIMO FDNs can be jointly optimized to match some perceptual characteristics of given multichannel room impulse responses, overcoming approaches available in the literature, and paving the way toward increasingly efficient and accurate real-time virtual room acoustics rendering.
Download Audio-Visual Talker Localization in Video for Spatial Sound Reproduction Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating the automatic extraction of the positional metadata of the talker relative to the camera’s reference frame. With the integration of the visual modality, this study expands upon our previous investigation focused solely on audio-based active speaker detection and localization. Our experiments compare conventional audio-visual approaches for active speaker detection that leverage monaural audio, our previous audio-only method that leverages multichannel recordings from a microphone array, and a novel audio-visual approach integrating vision and multichannel audio. We found the role of the two modalities to complement each other. Multichannel audio, overcoming the problem of visual occlusions, provides a double-digit reduction in detection error compared to audio-visual methods with single-channel audio. The combination of multichannel audio and vision further enhances spatial accuracy, leading to a four-percentage point increase in F1 score on the Tragic Talkers dataset. Future investigations will assess the robustness of the model in noisy and highly reverberant environments, as well as tackle the problem of off-screen speakers.
Download DDSP-Based Neural Waveform Synthesis of Polyphonic Guitar Performance From String-Wise MIDI Input We explore the use of neural synthesis for acoustic guitar from string-wise MIDI input. We propose four different systems and compare them with both objective metrics and subjective evaluation against natural audio and a sample-based baseline. We iteratively develop these four systems by making various considerations on the architecture and intermediate tasks, such as predicting pitch and loudness control features. We find that formulating the control feature prediction task as a classification task rather than a regression task yields better results. Furthermore, we find that our simplest proposed system, which directly predicts synthesis parameters from MIDI input performs the best out of the four proposed systems. Audio examples and code are available.