Download DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions This study introduces a novel and interpretable model, DiffVox,
for matching vocal effects in music production. DiffVox, short
for “Differentiable Vocal Fx", integrates parametric equalisation,
dynamic range control, delay, and reverb with efficient differentiable implementations to enable gradient-based optimisation for
parameter estimation. Vocal presets are retrieved from two datasets,
comprising 70 tracks from MedleyDB and 365 tracks from a private collection. Analysis of parameter correlations reveals strong
relationships between effects and parameters, such as the highpass and low-shelf filters often working together to shape the low
end, and the delay time correlating with the intensity of the delayed signals. Principal component analysis reveals connections to
McAdams’ timbre dimensions, where the most crucial component
modulates the perceived spaciousness while the secondary components influence spectral brightness. Statistical testing confirms
the non-Gaussian nature of the parameter distribution, highlighting
the complexity of the vocal effects space. These initial findings on
the parameter distributions set the foundation for future research
in vocal effects modelling and automatic mixing.
Download Improving intelligibility prediction under informational masking using an auditory saliency model The reduction of speech intelligibility in noise is usually dominated by energetic masking (EM) and informational masking (IM). Most state-of-the-art objective intelligibility measures (OIM) estimate intelligibility by quantifying EM. Few measures model the effect of IM in detail. In this study, an auditory saliency model, which intends to measure the probability of the sources obtaining auditory attention in a bottom-up process, was integrated into an OIM for improving the performance of intelligibility prediction under IM. While EM is accounted for by the original OIM, IM is assumed to arise from the listener’s attention switching between the target and competing sounds existing in the auditory scene. The performance of the proposed method was evaluated along with three reference OIMs by comparing the model predictions to the listener word recognition rates, for different noise maskers, some of which introduce IM. The results shows that the predictive accuracy of the proposed method is as good as the best reported in the literature. The proposed method, however, provides a physiologically-plausible possibility for both IM and EM modelling.
Download Modelling Experts’ Decisions on Assigning Narrative Importances of Objects in a Radio Drama Mix There is an increasing number of consumers of broadcast audio who suffer from a degree of hearing impairment. One of the methods developed for tackling this issue consists of creating customizable object-based audio mixes where users can attenuate parts of the mix using a simple complexity parameter. The method relies on the mixing engineer classifying audio objects in the mix according to their narrative importance. This paper focuses on automating this process. Individual tracks are classified based on their music, speech, or sound effect content. Then the decisions for assigning narrative importance to each segment of a radio drama mix are modelled using mixture distributions. Finally, the learned decisions and resultant mixes are evaluated using the Short Term Objective Intelligibility, with reference to the narrative importance selections made by the original producer. This approach has applications for providing customizable mixes for legacy content, or automatically generated media content where the engineer is not able to intervene.
Download Towards Efficient Emulation of Nonlinear Analog Circuits for Audio Using Constraint Stabilization and Convex Quadratic Programming This paper introduces a computationally efficient method for
the emulation of nonlinear analog audio circuits by combining state-space representations, constraint stabilization, and convex quadratic programming (QP). Unlike traditional virtual analog (VA) modeling approaches or computationally demanding
SPICE-based simulations, our approach reformulates the nonlinear
differential-algebraic (DAE) systems that arise from analog circuit
analysis into numerically stable optimization problems. The proposed method efficiently addresses the numerical challenges posed
by nonlinear algebraic constraints via constraint stabilization techniques, significantly enhancing robustness and stability, suitable
for real-time simulations. A canonical diode clipper circuit is presented as a test case, demonstrating that our method achieves accurate and faster emulations compared to conventional state-space
methods. Furthermore, our method performs very well even at
substantially lower sampling rates. Preliminary numerical experiments confirm that the proposed approach offers improved numerical stability and real-time feasibility, positioning it as a practical
solution for high-fidelity audio applications.
Download TorchFX: A Modern Approach to Audio DSP with PyTorch and GPU Acceleration The increasing complexity and real-time processing demands of
audio signals require optimized algorithms that utilize the computational power of Graphics Processing Units (GPUs).
Existing Digital Signal Processing (DSP) libraries often do not provide
the necessary efficiency and flexibility, particularly for integrating
with Artificial Intelligence (AI) models. In response, we introduce TorchFX: a GPU-accelerated Python library for DSP, engineered to facilitate sophisticated audio signal processing. Built on
the PyTorch framework, TorchFX offers an Object-Oriented interface similar to torchaudio but enhances functionality with a novel
pipe operator for intuitive filter chaining. The library provides a
comprehensive suite of Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters, with a focus on multichannel
audio, thereby facilitating the integration of DSP and AI-based
approaches.
Our benchmarking results demonstrate significant
efficiency gains over traditional libraries like SciPy, particularly
in multichannel contexts. While there are current limitations in
GPU compatibility, ongoing developments promise broader support and real-time processing capabilities. TorchFX aims to become a useful tool for the community, contributing to innovation
in GPU-accelerated DSP. TorchFX is publicly available on GitHub
at https://github.com/matteospanio/torchfx.
Download Towards an Invertible Rhythm Representation This paper investigates the development of a rhythm representation of music audio signals, that (i) is able to tackle rhythm related tasks and, (ii) is invertible, i.e. is suitable to reconstruct audio from it with the corresponding rhythm content being preserved. A conventional front-end processing schema is applied to the audio signal to extract time varying characteristics (accent features) of the signal. Next, a periodicity analysis method is proposed that is capable of reconstructing the accent features. Afterwards, a network consisting of Restricted Boltzmann Machines is applied to the periodicity function to learn a latent representation. This latent representation is finally used to tackle two distinct rhythm tasks, namely dance style classification and meter estimation. The results are promising for both input signal reconstruction and rhythm classification performance. Moreover, the proposed method is extended to generate random samples from the corresponding classes.
Download New techniques and Effects in Model-based Sound Synthesis Physical modeling and model-based sound synthesis have recently been among the most active topics of computer music and audio research. In the modeling approach one typically tries to simulate and duplicate the most prominent sound generation properties of the acoustic musical instrument under study. If desired, the models developed may then be modified in order to create sounds that are not common or even possible from physically realizable instruments. In addition to physically related principles it is possible to combine physical models with other synthesis and signal processing methods to realize hybrid modeling techniques.
This article is written as an overview of some recent results in model-based sound synthesis and related signal processing techniques. The focus is on modeling and synthesizing plucked string sounds, although the techniques may find much more widespread application. First, as a background, an advanced linear model of the acoustic guitar is discussed along with model control principles. Then the methodology to include inherent nonlinearities and time-varying features is introduced. Examples of string instrument nonlinearities are studied in the context of two specific instruments, the kantele and the tanbur, which exhibit interesting nonlinear effects.
Download Chebyshev Model and Synchronized Swept Sine Method in Nonlinear Audio Effect Modeling A new method for the identification of nonlinear systems, based on an input exponential swept sine signal has been proposed by Farina ten years ago. This method has been recently modified in purpose of nonlinear model estimation using a synchronized swept sine signal. It allows a robust and fast one-path analysis and identification of the unknown nonlinear system under test. In this paper this modified method is applied with Chebyshev polynomial decomposition. The combination of the Synchronized Swept Sine Method and Chebyshev polynomials leads to a nonlinear model consisting of several parallel branches, each branch containing a nonlinear Chebyshev polynomial following by a linear filter. The method is tested on an overdrive effect pedal to simulate an analog nonlinear effect in digital domain.
Download Improving Singing Language Identification through i-Vector Extraction Automatic language identification for singing is a topic that has not received much attention in the past years. Possible application scenarios include searching for musical pieces in a certain language, improvement of similarity search algorithms for music, and improvement of regional music classification and genre classification. It could also serve to mitigate the "glass ceiling" effect. Most existing approaches employ PPRLM processing (Parallel Phone Recognition followed by Language Modeling). We present a new approach for singing language identification. PLP, MFCC, and SDC features are extracted from audio files and then passed through an i-vector extractor. This algorithm reduces the training data for each sample to a single 450-dimensional feature vector. We then train Neural Networks and Support Vector Machines on these feature vectors. Due to the reduced data, the training process is very fast. The results are comparable to the state of the art, reaching accuracies of 83% on a large speech corpus and 78% on acapella singing. In contrast to PPRLM approaches, our algorithm does not require phoneme-wise annotations and is easier to implement.
Download Score level timbre transformations of violin sounds The ability of a sound synthesizer to provide realistic sounds depends to a great extent on the availability of expressive controls. One of the most important expressive features a user of the synthesizer would desire to have control of, is timbre. Timbre is a complex concept related to many musical indications in a score such as dynamics, accents, hand position, string played, or even indications referring timbre itself. Musical indications are in turn related to low level performance controls such as bow velocity or bow force. With the help of a data acquisition system able to record sound synchronized to performance controls and aligned to the performed score and by means of statistical analysis, we are able to model the interrelations among sound (timbre), controls and musical score indications. In this paper we present a procedure for score-controlled timbre transformations of violin sounds within a sample based synthesizer. Given a sound sample and its trajectory of performance controls: 1) a transformation of the controls trajectory is carried out according to the score indications, 2) a new timbre corresponding to the transformed trajectory is predicted by means of a timbre model that relates timbre with performance controls and 3) the timbre of the original sound is transformed by applying a timevarying filter calculated frame by frame as the difference of the original and predicted envelopes.