Paper Archive

Recent studies on classifying electric guitar effects have achieved high accuracy, particularly with deep learning techniques. However, these studies often rely on simplified datasets consisting mainly of single notes rather than realistic guitar recordings. Moreover, in the specific field of effect chain estimation, the literature tends to rely on large models, making them impractical for real-time or resource-constrained applications. In this work, we recorded realistic guitar performances using four different guitars and created three datasets by applying a chain of five effects with increasing complexity: (1) fixed order and parameters, (2) fixed order with randomly sampled parameters, and (3) random order and parameters. We also propose a novel Neural Architecture Search method aimed at discovering accurate yet compact convolutional neural network models to reduce power and memory consumption. We compared its performance to a basic random search strategy, showing that our custom Neural Architecture Search outperformed random search in identifying models that balance accuracy and complexity. We found that the number of convolutional and pooling layers becomes increasingly important as dataset complexity grows, while dense layers have less impact. Additionally, among the effects, tremolo was identified as the most challenging to classify.

Neural-Driven Multi-Band Processing for Automatic Equalization and Style Transfer

Parakrant Sarkar; Permagnus Lindborg

We present a Neural-Driven Multi-Band Processor (NDMP), a differentiable audio processing framework that augments a static sixband Parametric Equalizer (PEQ) with per-band dynamic range compression. We optimize this processor using neural inference for two tasks: Automatic Equalization (AutoEQ), which estimates tonal and dynamic corrections without a reference, and Production Style Transfer (NDMP-ST), which adapts the processing of an input signal to match the tonal and dynamic characteristics of a reference. We train NDMP using a self-supervised strategy, where the model learns to recover a clean signal from inputs degraded with randomly sampled NDMP parameters and gain adjustments. This setup eliminates the need for paired input–target data and enables end-to-end training with audio-domain loss functions. In the inference, AutoEQ enhances previously unseen inputs in a blind setting, while NDMP-ST performs style transfer by predicting taskspecific processing parameters. We evaluate our approach on the MUSDB18 dataset using both objective metrics (e.g., SI-SDR, PESQ, STFT loss) and a listening test. Our results show that NDMP consistently outperforms traditional PEQ and a PEQ+DRC (single-band) baseline, offering a robust neural framework for audio enhancement that combines learned spectral and dynamic control.

Real-Time Implementation of the Dynamic Stiff String Using Finite-Difference Time-Domain Methods and the Dynamic Grid

Silvin Willemsen; Stefania Serafin

Digital musical instruments based on physical modelling have gained increased popularity over the past years. This is partly due to recent advances in computational power, which allow for their real-time implementation. One of the great potentials for digital musical instruments based on physical models, is that one can go beyond what is physically possible and change properties of the instruments which are static in real life. This paper presents a real-time implementation of the dynamic stiff string using finitedifference time-domain (FDTD) methods. The defining parameters of the string can be varied in real time and change the underlying grid that these methods rely on based on the recently developed dynamic grid method. For most settings, parameter changes are nearly instantaneous and do not cause noticeable artefacts due to changes in the grid. A reliable way to prevent artefacts for all settings is under development.

Julian D. Parker; Sebastian J. Schlecht; Rudolf Rabenstein; Maximilian Schäfer

Physical Modeling Using Recurrent Neural Networks with Fast Convolutional Layers

Discrete-time modeling of acoustic, mechanical and electrical systems is a prominent topic in the musical signal processing literature. Such models are mostly derived by discretizing a mathematical model, given in terms of ordinary or partial differential equations, using established techniques. Recent work has applied the techniques of machine-learning to construct such models automatically from data for the case of systems which have lumped states described by scalar values, such as electrical circuits. In this work, we examine how similar techniques are able to construct models of systems which have spatially distributed rather than lumped states. We describe several novel recurrent neural network structures, and show how they can be thought of as an extension of modal techniques. As a proof of concept, we generate synthetic data for three physical systems and show that the proposed network structures can be trained with this data to reproduce the behavior of these systems.

Differentiable Piano Model for Midi-to-Audio Performance Synthesis

Lenny Renault; Rémi Mignot; Axel Roebel

Recent neural-based synthesis models have achieved impressive results for musical instrument sound generation. In particular, the Differentiable Digital Signal Processing (DDSP) framework enables the usage of spectral modeling analysis and synthesis techniques in fully differentiable architectures. Yet currently, it has only been used for modeling monophonic instruments. Leveraging the interpretability and modularity of this framework, the present work introduces a polyphonic differentiable model for piano sound synthesis, conditioned on Musical Instrument Digital Interface (MIDI) inputs. The model architecture is motivated by high-level acoustic modeling knowledge of the instrument which, in tandem with the sound structure priors inherent to the DDSP components, makes for a lightweight, interpretable and realistic sounding piano model. The proposed model has been evaluated in a listening test, demonstrating improved sound quality compared to a benchmark neural-based piano model, with significantly less parameters and even with reduced training data. The same listening test indicates that physical-modeling-based models still achieve better quality, but the differentiability of our lightened approach encourages its usage in other musical tasks dealing with polyphonic audio and symbolic data.

Differentiable grey-box modelling of phaser effects using frame-based spectral processing

Alistair Carson; Cassia Valentini-Botinhao; Simon King; Stefan Bilbao

DAFx-2023 - Copenhagen

Machine learning approaches to modelling analog audio effects have seen intensive investigation in recent years, particularly in the context of non-linear time-invariant effects such as guitar amplifiers. For modulation effects such as phasers, however, new challenges emerge due to the presence of the low-frequency oscillator which controls the slowly time-varying nature of the effect. Existing approaches have either required foreknowledge of this control signal, or have been non-causal in implementation. This work presents a differentiable digital signal processing approach to modelling phaser effects in which the underlying control signal and time-varying spectral response of the effect are jointly learned. The proposed model processes audio in short frames to implement a time-varying filter in the frequency domain, with a transfer function based on typical analog phaser circuit topology. We show that the model can be trained to emulate an analog reference device, while retaining interpretable and adjustable parameters. The frame duration is an important hyper-parameter of the proposed model, so an investigation was carried out into its effect on model accuracy. The optimal frame length depends on both the rate and transient decay-time of the target effect, but the frame length can be altered at inference time without a significant change in accuracy.

Upcylcing Android Phones into Embedded Audio Platforms

Victor Zappi; Carla Sophie Tapparo

DAFx-2023 - Copenhagen

There are millions of sophisticated Android phones in the world that get disposed of at a very high rate due to consumerism. Their computational power and built-in features, instead of being wasted when discarded, could be repurposed for creative applications such as musical instruments and interactive audio installations. However, audio programming on Android is complicated and comes with restrictions that heavily impact performance. To address this issue, we present LDSP, an open-source environment that can be used to easily upcycle Android phones into embedded platforms optimized for audio synthesis and processing. We conducted a benchmark study to compare the number of oscillators that can be run in parallel on LDSP with an equivalent audio app designed according to modern Android standards. Our study tested six phones ranging from 2014 to 2018 and running different Android versions. The results consistently demonstrate that LDSP provides a significant boost in performance, with some cases showing an increase of more than double, making even very old phones suitable for fairly advanced audio applications.

Antiderivative Antialiasing for Recurrent Neural Networks

Otto Mikkonen; Kurt James Werner

Neural networks have become invaluable for general audio processing tasks, such as virtual analog modeling of nonlinear audio equipment. For sequence modeling tasks in particular, recurrent neural networks (RNNs) have gained widespread adoption in recent years. Their general applicability and effectiveness stems partly from their inherent nonlinearity, which makes them prone to aliasing. Recent work has explored mitigating aliasing by oversampling the network—an approach whose effectiveness is directly linked with the incurred computational costs. This work explores an alternative route by extending the antiderivative antialiasing technique to explicit, computable RNNs. Detailed applications to the Gated Recurrent Unit and Long Short-Term Memory cell are shown as case studies. The proposed technique is evaluated on multiple pre-trained guitar amplifier models, assessing its impact on the amount of aliasing and model tonality. The method is shown to reduce the models’ tendency to alias considerably across all considered sample rates while only affecting their tonality moderately, without requiring high oversampling factors. The results of this study can be used to improve sound quality in neural audio processing tasks that employ a suitable class of RNNs. Additional materials are provided in the accompanying webpage.

Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial Approaches

Eloi Moliner; Michal Švento; Alec Wright; Lauri Juvela; Pavel Rajmic; Vesa Välimäki

Accurately estimating nonlinear audio effects without access to paired input-output signals remains a challenging problem. This work studies unsupervised probabilistic approaches for solving this task. We introduce a method, novel for this application, based on diffusion generative models for blind system identification, enabling the estimation of unknown nonlinear effects using blackand gray-box models. This study compares this method with a previously proposed adversarial approach, analyzing the performance of both methods under different parameterizations of the effect operator and varying lengths of available effected recordings. Through experiments on guitar distortion effects, we show that the diffusion-based approach provides more stable results and is less sensitive to data availability, while the adversarial approach is superior at estimating more pronounced distortion effects. Our findings contribute to the robust unsupervised blind estimation of audio effects, demonstrating the potential of diffusion models for system identification in music technology.

Continuous State Modeling for Statistical Spectral Synthesis

Tim-Tarek Grund; Henrik von Coler

Continuous State Markovian Spectral Modeling is a novel approach for parametric synthesis of spectral modeling parameters, based on the sines plus noise paradigm. The method aims specifically at capturing shimmer and jitter - micro-fluctuations in the partials’ frequency and amplitude trajectories, which are essential for the timbre of musical instruments. It allows for parametric control over the timbral qualities, while removing the need for the more computationally expensive and restrictive process of the discrete state space modeling method. A qualitative comparison between an original violin sound and a re-synthesis shows the ability of the algorithm to reproduce the micro-fluctuations, considering their stochastic and spectral properties.