Download Solid State Bus-Comp: A Large-Scale and Diverse Dataset for Dynamic Range Compressor Virtual Analog Modeling Virtual Analog (VA) modeling aims to simulate the behavior
of hardware circuits via algorithms to replicate their tone digitally.
Dynamic Range Compressor (DRC) is an audio processing module
that controls the dynamics of a track by reducing and amplifying
the volumes of loud and quiet sounds, which is essential in music
production. In recent years, neural-network-based VA modeling has
shown great potential in producing high-fidelity models. However,
due to the lack of data quantity and diversity, their generalization
ability in different parameter settings and input sounds is still limited. To tackle this problem, we present Solid State Bus-Comp, the
first large-scale and diverse dataset for modeling the classical VCA
compressor — SSL 500 G-Bus. Specifically, we manually collected
175 unmastered songs from the Cambridge Multitrack Library. We
recorded the compressed audio in 220 parameter combinations,
resulting in an extensive 2528-hour dataset with diverse genres, instruments, tempos, and keys. Moreover, to facilitate the use of our
proposed dataset, we conducted benchmark experiments in various
open-sourced black-box and grey-box models, as well as white-box
plugins. We also conducted ablation studies in different data subsets to illustrate the effectiveness of the improved data diversity and
quantity. The dataset and demos are on our project page: https:
//www.yichenggu.com/SolidStateBusComp/.
Download Hyperbolic Embeddings for Order-Aware Classification of Audio Effect Chains Audio effects (AFXs) are essential tools in music production, frequently applied in chains to shape timbre and dynamics. The order of AFXs in a chain plays a crucial role in determining the final sound, particularly when non-linear (e.g., distortion) or timevariant (e.g., chorus) processors are involved. Despite its importance, most AFX-related studies have primarily focused on estimating effect types and their parameters from a wet signal. To
address this gap, we formulate AFX chain recognition as the task
of jointly estimating AFX types and their order from a wet signal.
We propose a neural-network-based method that embeds wet signals into a hyperbolic space and classifies their AFX chains. Hyperbolic space can represent tree-structured data more efficiently
than Euclidean space due to its exponential expansion property.
Since AFX chains can be represented as trees, with AFXs as nodes
and edges encoding effect order, hyperbolic space is well-suited
for modeling the exponentially growing and non-commutative nature of ordered AFX combinations, where changes in effect order can result in different final sounds. Experiments using guitar
sounds demonstrate that, with an appropriate curvature, the proposed method outperforms its Euclidean counterpart. Further analysis based on AFX type and chain length highlights the effectiveness of the proposed method in capturing AFX order.
Download Improving Lyrics-to-Audio Alignment Using Frame-wise Phoneme Labels with Masked Cross Entropy Loss This paper addresses the task of lyrics-to-audio alignment, which
involves synchronizing textual lyrics with corresponding music
audio. Most publicly available datasets for this task provide annotations only at the line or word level. This poses a challenge
for training lyrics-to-audio models due to the lack of frame-wise
phoneme labels. However, we find that phoneme labels can be
partially derived from word-level annotations: for single-phoneme
words, all frames corresponding to the word can be labeled with
the same phoneme; for multi-phoneme words, phoneme labels can
be assigned at the first and last frames of the word. To leverage
this partial information, we construct a mask for those frames and
propose a masked frame-wise cross-entropy (CE) loss that considers only frames with known phoneme labels. As a baseline model,
we adopt an autoencoder trained with a Connectionist Temporal
Classification (CTC) loss and a reconstruction loss. We then enhance the training process by incorporating the proposed framewise masked CE loss. Experimental results show that incorporating the frame-wise masked CE loss improves alignment performance. In comparison to other state-of-the art models, our model
provides a comparable Mean Absolute Error (MAE) of 0.216 seconds and a top Median Absolute Error (MedAE) of 0.041 seconds
on the testing Jamendo dataset.
Download Empirical Results for Adjusting Truncated Backpropagation Through Time While Training Neural Audio Effects This paper investigates the optimization of Truncated Backpropagation Through Time (TBPTT) for training neural networks in
digital audio effect modeling, with a focus on dynamic range compression. The study evaluates key TBPTT hyperparameters – sequence number, batch size, and sequence length – and their influence on model performance. Using a convolutional-recurrent architecture, we conduct extensive experiments across datasets with
and without conditioning by user controls. Results demonstrate
that carefully tuning these parameters enhances model accuracy
and training stability, while also reducing computational demands.
Objective evaluations confirm improved performance with optimized settings, while subjective listening tests indicate that the
revised TBPTT configuration maintains high perceptual quality.
Download Automatic Classification of Chains of Guitar Effects Through Evolutionary Neural Architecture Search Recent studies on classifying electric guitar effects have achieved
high accuracy, particularly with deep learning techniques. However, these studies often rely on simplified datasets consisting
mainly of single notes rather than realistic guitar recordings.
Moreover, in the specific field of effect chain estimation, the literature tends to rely on large models, making them impractical for
real-time or resource-constrained applications. In this work, we
recorded realistic guitar performances using four different guitars
and created three datasets by applying a chain of five effects with
increasing complexity: (1) fixed order and parameters, (2) fixed order with randomly sampled parameters, and (3) random order and
parameters. We also propose a novel Neural Architecture Search
method aimed at discovering accurate yet compact convolutional
neural network models to reduce power and memory consumption.
We compared its performance to a basic random search strategy,
showing that our custom Neural Architecture Search outperformed
random search in identifying models that balance accuracy and
complexity. We found that the number of convolutional and pooling layers becomes increasingly important as dataset complexity
grows, while dense layers have less impact. Additionally, among
the effects, tremolo was identified as the most challenging to classify.
Download Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space This paper presents a novel approach to neural instrument sound
synthesis using a two-stage semi-supervised learning framework
capable of generating pitch-accurate, high-quality music samples
from an expressive timbre latent space. Existing approaches that
achieve sufficient quality for music production often rely on highdimensional latent representations that are difficult to navigate and
provide unintuitive user experiences. We address this limitation
through a two-stage training paradigm: first, we train a pitchtimbre disentangled 2D representation of audio samples using a
Variational Autoencoder; second, we use this representation as
conditioning input for a Transformer-based generative model. The
learned 2D latent space serves as an intuitive interface for navigating and exploring the sound landscape. We demonstrate that the
proposed method effectively learns a disentangled timbre space,
enabling expressive and controllable audio generation with reliable
pitch conditioning. Experimental results show the model’s ability to capture subtle variations in timbre while maintaining a high
degree of pitch accuracy. The usability of our method is demonstrated in an interactive web application, highlighting its potential
as a step towards future music production environments that are
both intuitive and creatively empowering:
https://pgesam.faresschulz.com/.
Download Neural-Driven Multi-Band Processing for Automatic Equalization and Style Transfer We present a Neural-Driven Multi-Band Processor (NDMP), a differentiable audio processing framework that augments a static sixband Parametric Equalizer (PEQ) with per-band dynamic range
compression. We optimize this processor using neural inference
for two tasks: Automatic Equalization (AutoEQ), which estimates
tonal and dynamic corrections without a reference, and Production
Style Transfer (NDMP-ST), which adapts the processing of an input signal to match the tonal and dynamic characteristics of a reference. We train NDMP using a self-supervised strategy, where the
model learns to recover a clean signal from inputs degraded with
randomly sampled NDMP parameters and gain adjustments. This
setup eliminates the need for paired input–target data and enables
end-to-end training with audio-domain loss functions. In the inference, AutoEQ enhances previously unseen inputs in a blind setting, while NDMP-ST performs style transfer by predicting taskspecific processing parameters. We evaluate our approach on the
MUSDB18 dataset using both objective metrics (e.g., SI-SDR,
PESQ, STFT loss) and a listening test.
Our results show that
NDMP consistently outperforms traditional PEQ and a PEQ+DRC
(single-band) baseline, offering a robust neural framework for audio enhancement that combines learned spectral and dynamic control.
Download Physics-Informed Deep Learning for Nonlinear Friction Model of Bow-String Interaction This study investigates the use of an unsupervised, physicsinformed deep learning framework to model a one-degree-offreedom mass-spring system subjected to a nonlinear friction bow
force and governed by a set of ordinary differential equations.
Specifically, it examines the application of Physics-Informed Neural Networks (PINNs) and Physics-Informed Deep Operator Networks (PI-DeepONets). Our findings demonstrate that PINNs successfully address the problem across different bow force scenarios,
while PI-DeepONets perform well under low bow forces but encounter difficulties at higher forces. Additionally, we analyze the
Hessian eigenvalue density and visualize the loss landscape. Overall, the presence of large Hessian eigenvalues and sharp minima
indicates highly ill-conditioned optimization.
These results underscore the promise of physics-informed
deep learning for nonlinear modelling in musical acoustics, while
also revealing the limitations of relying solely on physics-based
approaches to capture complex nonlinearities. We demonstrate
that PI-DeepONets, with their ability to generalize across varying parameters, are well-suited for sound synthesis. Furthermore,
we demonstrate that the limitations of PI-DeepONets under higher
forces can be mitigated by integrating observation data within a
hybrid supervised-unsupervised framework. This suggests that a
hybrid supervised-unsupervised DeepONets framework could be
a promising direction for future practical applications.
Download Modeling the Impulse Response of Higher-Order Microphone Arrays Using Differentiable Feedback Delay Networks Recently, differentiable multiple-input multiple-output Feedback
Delay Networks (FDNs) have been proposed for modeling target multichannel room impulse responses by optimizing their parameters according to perceptually-driven time-domain descriptors. However, in spatial audio applications, frequency-domain
characteristics and inter-channel differences are crucial for accurately replicating a given soundfield. In this article, targeting the
modeling of the response of higher-order microphone arrays, we
improve on the methodology by optimizing the FDN parameters
using a novel spatially-informed loss function, demonstrating its
superior performance over previous approaches and paving the
way toward the use of differentiable FDNs in spatial audio applications such as soundfield reconstruction and rendering.
Download Antiderivative Antialiasing for Recurrent Neural Networks Neural networks have become invaluable for general audio processing tasks, such as virtual analog modeling of nonlinear audio equipment.
For sequence modeling tasks in particular, recurrent neural networks (RNNs) have gained widespread adoption in recent years. Their general applicability and effectiveness
stems partly from their inherent nonlinearity, which makes them
prone to aliasing. Recent work has explored mitigating aliasing
by oversampling the network—an approach whose effectiveness is
directly linked with the incurred computational costs. This work
explores an alternative route by extending the antiderivative antialiasing technique to explicit, computable RNNs. Detailed applications to the Gated Recurrent Unit and Long Short-Term Memory cell are shown as case studies. The proposed technique is evaluated
on multiple pre-trained guitar amplifier models, assessing its impact on the amount of aliasing and model tonality. The method is
shown to reduce the models’ tendency to alias considerably across
all considered sample rates while only affecting their tonality moderately, without requiring high oversampling factors. The results
of this study can be used to improve sound quality in neural audio
processing tasks that employ a suitable class of RNNs. Additional
materials are provided in the accompanying webpage.