Download Modulation Extraction for LFO-driven Audio Effects
Low frequency oscillator (LFO) driven audio effects such as phaser, flanger, and chorus, modify an input signal using time-varying filters and delays, resulting in characteristic sweeping or widening effects. It has been shown that these effects can be modeled using neural networks when conditioned with the ground truth LFO signal. However, in most cases, the LFO signal is not accessible and measurement from the audio signal is nontrivial, hindering the modeling process. To address this, we propose a framework capable of extracting arbitrary LFO signals from processed audio across multiple digital audio effects, parameter settings, and instrument configurations. Since our system imposes no restrictions on the LFO signal shape, we demonstrate its ability to extract quasiperiodic, combined, and distorted modulation signals that are relevant to effect modeling. Furthermore, we show how coupling the extraction model with a simple processing network enables training of end-to-end black-box models of unseen analog or digital LFO-driven audio effects using only dry and wet audio pairs, overcoming the need to access the audio effect or internal LFO signal. We make our code available and provide the trained audio effect models in a real-time VST plugin1 .
Download Dynamic Pitch Warping for Expressive Vocal Retuning
This work introduces the use of the Dynamic Pitch Warping (DPW) method for automatic pitch correction of singing voice audio signals. DPW is designed to dynamically tune any pitch trajectory to a predefined scale while preserving its expressive ornamentation. DPW has three degrees of freedom to modify the fundamental frequency (f0 ) signal: detection interval, critical time, and transition time. Together, these parameters allow us to define a pitch velocity condition that triggers an adaptive correction of the pitch trajectory (pitch warping). We compared our approach to Antares Autotune (the most commonly used software brand, abbreviated as ATA in this article). The pitch correction in ATA has two degrees of freedom: a triggering threshold (flextune) and the transition time (retune speed). The pitch trajectories that we compare were extracted from autotuned-in-ATA audio signals, and the DPW algorithm implemented over the f0 of the input audio tracks. We studied specifically pitch correction for three typical situations of f0 curves: staircase, vibrato, free-path. We measured the proximity of the corrected pitch trajectories to the original ones for each case obtaining that the DPW pitch correction method is better to preserve vibrato while keeping the f0 free path. In contrast, ATA is more effective in generating staircase curves, but fails for notsmall vibratos and free-path curves. We have also implemented an off-line automatic picth tuner using DPW.
Download Towards High Sampling Rate Sound Synthesis on FPGA
This “Late Breaking Results” paper presents an ongoing project aiming at providing an accessible and easy-to-use platform for high sampling rate real-time audio Digital Signal Processing (DSP). The current version can operate in the megahertz range and we aim to achieve sampling rates as high as 20 MHz in the near future. It relies on the Syfala compiler which can be used to program Field Programmable Gate Array (FPGA) platforms at a high level using the FAUST programming language. In our system, the audio DAC is directly implemented on the FPGA chip, providing exceptional performances in terms of audio latency as well. After giving an overview of the state of the art of this field, we describe the way this tool works and we present ongoing and future developments.
Download P-RAVE: Improving RAVE through pitch conditioning and more with application to singing voice conversion
In this paper, we introduce means of improving fidelity and controllability of the RAVE generative audio model by factorizing pitch and other features. We accomplish this primarily by creating a multi-band excitation signal capturing pitch and/or loudness information, and by using it to FiLM-condition the RAVE generator. To further improve fidelity when applied to a singing voice application explored here, we also consider concatenating a supervised phonetic encoding to its latent representation. An ablation analysis highlights the improved performance of our incremental improvements relative to the baseline RAVE model. As our primary enhancement involves adding a stable pitch conditioning mechanism into the RAVE model, we simply call our method P-RAVE.
Download Designing a Library for Generative Audio in Unity
This paper overviews URALi, a library designed to add generative sound synthesis capabilities to Unity. This project, in particular, is directed towards audiovisual artists keen on working with algorithmic systems in Unity but can not find native solutions for procedural sound synthesis to pair with their visual and control ones. After overviewing the options available in Unity concerning audio, this paper reports on the functioning and architecture of the library, which is an ongoing project.
Download Interpretable timbre synthesis using variational autoencoders regularized on timbre descriptors
Controllable timbre synthesis has been a subject of research for several decades, and deep neural networks have been the most successful in this area. Deep generative models such as Variational Autoencoders (VAEs) have the ability to generate a high-level representation of audio while providing a structured latent space. Despite their advantages, the interpretability of these latent spaces in terms of human perception is often limited. To address this limitation and enhance the control over timbre generation, we propose a regularized VAE-based latent space that incorporates timbre descriptors. Moreover, we suggest a more concise representation of sound by utilizing its harmonic content, in order to minimize the dimensionality of the latent space.
Download What you hear is what you see: Audio quality from Image Quality Metrics
In this study, we investigate the feasibility of utilizing stateof-the-art perceptual image metrics for evaluating audio signals by representing them as spectrograms. The encouraging outcome of the proposed approach is based on the similarity between the neural mechanisms in the auditory and visual pathways. Furthermore, we customise one of the metrics which has a psychoacoustically plausible architecture to account for the peculiarities of sound signals. We evaluate the effectiveness of our proposed metric and several baseline metrics using a music dataset, with promising results in terms of the correlation between the metrics and the perceived quality of audio as rated by human evaluators.
Download Dynamic Stochastic Wavetable Synthesis
Dynamic Stochastic Synthesis (DSS) is a direct digital synthesis method invented by composer Iannis Xenakis and notably employed in his 1991 composition GENDY3. In its original conception, DSS generates periodic waves by linear interpolation between a set of breakpoints in amplitude–time space. The breakpoints change position each period, displaced by random walks via high-level parameters that induce various behaviors and timbres along the pitch–noise continuum. The following paper proposes Dynamic Stochastic Wavetable Synthesis as a modification and generalization of DSS that enables its application to table-lookup oscillators, allowing arbitrary sample data to become the basis of a DSS process. We describe the considerations affecting the development of such an algorithm and offer a real-time implementation informed by the analysis.
Download Expressive Piano Performance Rendering from Unpaired Data
Recent advances in data-driven expressive performance rendering have enabled automatic models to reproduce the characteristics and the variability of human performances of musical compositions. However, these models need to be trained with aligned pairs of scores and performances and they rely notably on score-specific markings, which limits their scope of application. This work tackles the piano performance rendering task in a low-informed setting by only considering the score note information and without aligned data. The proposed model relies on an adversarial training where the basic score notes properties are modified in order to reproduce the expressive qualities contained in a dataset of real performances. First results for unaligned score-to-performance rendering are presented through a conducted listening test. While the interpretation quality is not on par with highly-supervised methods and human renditions, our method shows promising results for transferring realistic expressivity into scores.
Download Audio Effect Chain Estimation and Dry Signal Recovery From Multi-Effect-Processed Musical Signals
In this paper we propose a method that can address a novel task, audio effect (AFX) chain estimation and dry signal recovery. AFXs are indispensable in modern sound design workflows. Sound engineers often cascade different AFXs (as an AFX chain) to achieve their desired soundscapes. Given a multi-AFX-applied solo instrument performance (wet signal), our method can automatically estimate the applied AFX chain and recover its unprocessed dry signal, while previous research only addresses one of them. The estimated chain is useful for novice engineers in learning practical usages of AFXs, and the recovered signal can be reused with a different AFX chain. To solve this task, we first develop a deep neural network model that estimates the last-applied AFX and undoes its AFX at a time. We then iteratively apply the same model to estimate the AFX chain and eventually recover the dry signal from the wet signal. Our experiments on guitar phrase recordings with various AFX chains demonstrate the validity of our method for both the AFX-chain estimation and dry signal recovery. We also confirm that the input wet signal can be reproduced by applying the estimated AFX chain to the recovered dry signal.