Download Audio Processor Parameters: Estimating Distributions Instead of Deterministic Values
Audio effects and sound synthesizers are widely used processors in popular music. Their parameters control the quality of the output sound. Multiple combinations of parameters can lead to the same sound. While recent approaches have been proposed to estimate these parameters given only the output sound, those are deterministic, i.e. they only estimate a single solution among the many possible parameter configurations. In this work, we propose to model the parameters as probability distributions instead of deterministic values. To learn the distributions, we optimize two objectives: (1) we minimize the reconstruction error between the ground truth output sound and the one generated using the estimated parameters, asisit usuallydone, but also(2)we maximize the parameter diversity, using entropy. We evaluate our approach through two numerical audio experiments to show its effectiveness. These results show how our approach effectively outputs multiple combinations of parameters to match one sound.
Download A Statistics-Driven Differentiable Approach for Sound Texture Synthesis and Analysis
In this work, we introduce TexStat, a novel loss function specifically designed for the analysis and synthesis of texture sounds characterized by stochastic structure and perceptual stationarity. Drawing inspiration from the statistical and perceptual framework of McDermott and Simoncelli, TexStat identifies similarities between signals belonging to the same texture category without relying on temporal structure. We also propose using TexStat as a validation metric alongside Frechet Audio Distances (FAD) to evaluate texture sound synthesis models. In addition to TexStat, we present TexEnv, an efficient, lightweight and differentiable texture sound synthesizer that generates audio by imposing amplitude envelopes on filtered noise. We further integrate these components into TexDSP, a DDSP-inspired generative model tailored for texture sounds. Through extensive experiments across various texture sound types, we demonstrate that TexStat is perceptually meaningful, time-invariant, and robust to noise, features that make it effective both as a loss function for generative tasks and as a validation metric. All tools and code are provided as open-source contributions and our PyTorch implementations are efficient, differentiable, and highly configurable, enabling its use in both generative tasks and as a perceptually grounded evaluation metric.
Download Blind Arbitrary Reverb Matching
Reverb provides psychoacoustic cues that convey information concerning relative locations within an acoustical space. The need arises often in audio production to impart an acoustic context on an audio track that resembles a reference track. One tool for making audio tracks appear to be recorded in the same space is by applying reverb to a dry track that is similar to the reverb in a wet one. This paper presents a model for the task of “reverb matching,” where we attempt to automatically add artificial reverb to a track, making it sound like it was recorded in the same space as a reference track. We propose a model architecture for performing reverb matching and provide subjective experimental results suggesting that the reverb matching model can perform as well as a human. We also provide open source software for generating training data using an arbitrary Virtual Studio Technology plug-in.
Download Audio Effect Chain Estimation and Dry Signal Recovery From Multi-Effect-Processed Musical Signals
In this paper we propose a method that can address a novel task, audio effect (AFX) chain estimation and dry signal recovery. AFXs are indispensable in modern sound design workflows. Sound engineers often cascade different AFXs (as an AFX chain) to achieve their desired soundscapes. Given a multi-AFX-applied solo instrument performance (wet signal), our method can automatically estimate the applied AFX chain and recover its unprocessed dry signal, while previous research only addresses one of them. The estimated chain is useful for novice engineers in learning practical usages of AFXs, and the recovered signal can be reused with a different AFX chain. To solve this task, we first develop a deep neural network model that estimates the last-applied AFX and undoes its AFX at a time. We then iteratively apply the same model to estimate the AFX chain and eventually recover the dry signal from the wet signal. Our experiments on guitar phrase recordings with various AFX chains demonstrate the validity of our method for both the AFX-chain estimation and dry signal recovery. We also confirm that the input wet signal can be reproduced by applying the estimated AFX chain to the recovered dry signal.
Download A Diffusion-Based Generative Equalizer for Music Restoration
This paper presents a novel approach to audio restoration, focusing on the enhancement of low-quality music recordings, and in particular historical ones. Building upon a previous algorithm called BABE, or Blind Audio Bandwidth Extension, we introduce BABE-2, which presents a series of improvements. This research broadens the concept of bandwidth extension to generative equalization, a task that, to the best of our knowledge, has not been previously addressed for music restoration. BABE-2 is built around an optimization algorithm utilizing priors from diffusion models, which are trained or fine-tuned using a curated set of high-quality music tracks. The algorithm simultaneously performs two critical tasks: estimation of the filter degradation magnitude response and hallucination of the restored audio. The proposed method is objectively evaluated on historical piano recordings, showing an enhancement over the prior version. The method yields similarly impressive results in rejuvenating the works of renowned vocalists Enrico Caruso and Nellie Melba. This research represents an advancement in the practical restoration of historical music. Historical music restoration examples are available at: research.spa.aalto.fi/publications/papers/dafx-babe2/.
Download A method for spectrum separation and envelope estimation of the residual in spectrum modeling of musical sound
We propose an original technique for separating the spectrum of the noisy residual component from that of the harmonic, quasideterministic one, and to estimate the envelope of the residual, for the spectrum modeling of musical sounds. The algorithm for spectrum separation relies on nonlinear transformations of the amplitude spectrum of the sampled signal (obtained via FFT), which allow to eliminate the dominant partials without the need for precisely tuned notch filters. The envelope estimation is performed by calculating the energy of the signal in the frequency domain, over a sliding time window. Eventually the residual can be obtained by combining its spectrum and envelope, so that separate processing can be performed on the two.
Download Pinna Morphological Parameters Influencing HRTF Sets
Head-Related Transfer Functions (HRTFs) are one of the main aspects of binaural rendering. By definition, these functions express the deep linkage that exists between hearing and morphology especially of the torso, head and ears. Although the perceptive effects of HRTFs is undeniable, the exact influence of the human morphology is still unclear. Its reduction into few anthropometric measurements have led to numerous studies aiming at establishing a ranking of these parameters. However, no consensus has yet been set. In this paper, we study the influence of the anthropometric measurements of the ear, as defined by the CIPIC database, on the HRTFs. This is done through the computation of HRTFs by Fast Multipole Boundary Element Method (FM-BEM) from a parametric model of torso, head and ears. Their variations are measured with 4 different spectral metrics over 4 frequency bands spanning from 0 to 16kHz. Our contribution is the establishment of a ranking of the selected parameters and a comparison to what has already been obtained by the community. Additionally, a discussion over the relevance of each approach is conducted, especially when it relies on the CIPIC data, as well as a discussion over the CIPIC database limitations.
Download Categorisation of Distortion Profiles in Relation to Audio Quality
Since digital audio is encoded as discrete samples of the audio waveform, much can be said about a recording by the statistical properties of these samples. In this paper, a dataset of CD audio samples is analysed; the probability mass function of each audio clip informs a feature set which describes attributes of the musical recording related to loudness, dynamics and distortion. This allows musical recordings to be classified according to their “distortion character”, a concept which describes the nature of amplitude distortion in mastered audio. A subjective test was designed in which such recordings were rated according to the perception of their audio quality. It is shown that participants can discern between three different distortion characters; ratings of audio quality were significantly different (F (1, 2) = 5.72, p < 0.001, η 2 = 0.008) as were the words used to describe the attributes on which quality was assessed (χ2 (8, N = 547) = 33.28, p < 0.001). This expands upon previous work showing links between the effects of dynamic range compression and audio quality in musical recordings, by highlighting perceptual differences.
Download Relative Music Loudness Estimation Using Temporal Convolutional Networks and a CNN Feature Extraction Front-End
Relative music loudness estimation is a MIR task that consists in dividing audio in segments of three classes: Foreground Music, Background Music and No Music. Given the temporal correlation of music, in this work we approach the task using a type of network with the ability to model temporal context: the Temporal Convolutional Network (TCN). We propose two architectures: a TCN, and a novel architecture resulting from the combination of a TCN with a Convolutional Neural Network (CNN) front-end. We name this new architecture CNN-TCN. We expect the CNN front-end to work as a feature extraction strategy to achieve a more efficient usage of the network’s parameters. We use the OpenBMAT dataset to train and test 40 TCN and 80 CNN-TCN models with two grid searches over a set of hyper-parameters. We compare our models with the two best algorithms submitted to the tasks of music detection and relative music loudness estimation in MIREX 2019. All our models outperform the MIREX algorithms even when using a lower number of parameters. The CNN-TCN emerges as the best architecture as all its models outperform all TCN models. We show that adding a CNN front-end to a TCN can actually reduce the number of parameters of the network while improving performance. The CNN front-end effectively works as a feature extractor producing consistent patterns that identify different combinations of music and non-music sounds and also helps in producing a smoother output in comparison to the TCN models.
Download Model Bending: Teaching Circuit Models New Tricks
A technique is introduced for generating novel signal processing systems grounded in analog electronic circuits, called model bending. By applying the ideas behind circuit bending to models of nonlinear analog circuits it is possible to create novel nonlinear signal processors which mimic the behavior of analog electronics, but which are not possible to implement in the analog realm. The history of both circuit bending and circuit modeling is discussed, as well as a theoretical basis for how these approaches can complement each other. Potential pitfalls to the practical application of model bending are highlighted and suggested solutions to those problems are provided, with examples.