Download Physics-Informed Deep Learning for Nonlinear Friction Model of Bow-String Interaction This study investigates the use of an unsupervised, physicsinformed deep learning framework to model a one-degree-offreedom mass-spring system subjected to a nonlinear friction bow
force and governed by a set of ordinary differential equations.
Specifically, it examines the application of Physics-Informed Neural Networks (PINNs) and Physics-Informed Deep Operator Networks (PI-DeepONets). Our findings demonstrate that PINNs successfully address the problem across different bow force scenarios,
while PI-DeepONets perform well under low bow forces but encounter difficulties at higher forces. Additionally, we analyze the
Hessian eigenvalue density and visualize the loss landscape. Overall, the presence of large Hessian eigenvalues and sharp minima
indicates highly ill-conditioned optimization.
These results underscore the promise of physics-informed
deep learning for nonlinear modelling in musical acoustics, while
also revealing the limitations of relying solely on physics-based
approaches to capture complex nonlinearities. We demonstrate
that PI-DeepONets, with their ability to generalize across varying parameters, are well-suited for sound synthesis. Furthermore,
we demonstrate that the limitations of PI-DeepONets under higher
forces can be mitigated by integrating observation data within a
hybrid supervised-unsupervised framework. This suggests that a
hybrid supervised-unsupervised DeepONets framework could be
a promising direction for future practical applications.
Download Biquad Coefficients Optimization via Kolmogorov-Arnold Networks Conventional Deep Learning (DL) approaches to Infinite Impulse
Response (IIR) filter coefficients estimation from arbitrary frequency response are quite limited. They often suffer from inefficiencies such as tight training requirements, high complexity, and
limited accuracy. As an alternative, in this paper, we explore the
use of Kolmogorov-Arnold Networks (KANs) to predict the IIR
filter—specifically biquad coefficients—effectively. By leveraging the high interpretability and accuracy of KANs, we achieve
smooth coefficients’ optimization. Furthermore, by constraining
the search space and exploring different loss functions, we demonstrate improved performance in speed and accuracy. Our approach
is evaluated against other existing differentiable IIR filter solutions. The results show significant advantages of KANs over existing methods, offering steadier convergences and more accurate
results. This offers new possibilities for integrating digital infinite
impulse response (IIR) filters into deep-learning frameworks.
Download Generative Latent Spaces for Neural Synthesis of Audio Textures This paper investigates the synthesis of audio textures and the
structure of generative latent spaces using Variational Autoencoders (VAEs) within two paradigms of neural audio synthesis:
DSP-inspired and data-driven approaches. For each paradigm, we
propose VAE-based frameworks that allow fine-grained temporal
control. We introduce datasets across three categories of environmental sounds to support our investigations. We evaluate and compare the models’ reconstruction performance using objective metrics, and investigate their generative capabilities and latent space
structure through latent space interpolations.
Download Piano-SSM: Diagonal State Space Models for Efficient Midi-to-Raw Audio Synthesis Deep State Space Models (SSMs) have shown remarkable performance in long-sequence reasoning tasks, such as raw audio
classification, and audio generation. This paper introduces PianoSSM, an end-to-end deep SSM neural network architecture designed to synthesize raw piano audio directly from MIDI input.
The network requires no intermediate representations or domainspecific expert knowledge, simplifying training and improving accessibility.
Quantitative evaluations on the MAESTRO dataset
show that Piano-SSM achieves a Multi-Scale Spectral Loss (MSSL)
of 7.02 at 16kHz, outperforming DDSP-Piano v1 with a MSSL of
7.09. At 24kHz, Piano-SSM maintains competitive performance
with an MSSL of 6.75, closely matching DDSP-Piano v2’s result of 6.58. Evaluations on the MAPS dataset achieve an MSSL
score of 8.23, which demonstrates the generalization capability
even when training with very limited data. Further analysis highlights Piano-SSM’s ability to train on high sampling-rate audio
while synthesizing audio at lower sampling rates, explicitly linking performance loss to aliasing effects. Additionally, the proposed model facilitates real-time causal inference through a custom C++17 header-only implementation. Using an Intel Core i712700 processor at 4.5GHz, with single core inference, allows synthesizing one second of audio at 44.1kHz in 0.44s with a workload of 23.1GFLOPS/s and an 10.1µs input/output delay with the
largest network. While the smallest network at 16kHz only needs
0.04s with 2.3GFLOP/s and 2.6µs input/output delay. These results underscore Piano-SSM’s practical utility and efficiency in
real-time audio synthesis applications.
Download MorphDrive: Latent Conditioning for Cross-Circuit Effect Modeling and a Parametric Audio Dataset of Analog Overdrive Pedals In this paper, we present an approach to the neural modeling of
overdrive guitar pedals with conditioning from a cross-circuit and
cross-setting latent space. The resulting network models the behavior of multiple overdrive pedals across different settings, offering continuous morphing between real configurations and hybrid
behaviors. Compact conditioning spaces are obtained through unsupervised training of a variational autoencoder with adversarial
training, resulting in accurate reconstruction performance across
different sets of pedals. We then compare three Hyper-Recurrent
architectures for processing, including dynamic and static HyperRNNs, and a smaller model for real-time processing. Additionally,
we present pOD-set, a new open dataset including recordings of
27 analog overdrive pedals, each with 36 gain and tone parameter combinations totaling over 97 hours of recordings. Precise parameter setting was achieved through a custom-deployed recording
robot.
Download Inference-Time Structured Pruning for Real-Time Neural Network Audio Effects Structured pruning is a technique for reducing the computational
load and memory footprint of neural networks by removing structured subsets of parameters according to a predefined schedule
or ranking criterion.
This paper investigates the application of
structured pruning to real-time neural network audio effects, focusing on both feedforward networks and recurrent architectures.
We evaluate multiple pruning strategies at inference time, without retraining, and analyze their effects on model performance. To
quantify the trade-off between parameter count and audio fidelity,
we construct a theoretical model of the approximation error as a
function of network architecture and pruning level. The resulting bounds establish a principled relationship between pruninginduced sparsity and functional error, enabling informed deployment of neural audio effects in constrained real-time environments.
Download Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial Approaches Accurately estimating nonlinear audio effects without access to
paired input-output signals remains a challenging problem. This
work studies unsupervised probabilistic approaches for solving this
task. We introduce a method, novel for this application, based
on diffusion generative models for blind system identification, enabling the estimation of unknown nonlinear effects using blackand gray-box models. This study compares this method with a
previously proposed adversarial approach, analyzing the performance of both methods under different parameterizations of the
effect operator and varying lengths of available effected recordings. Through experiments on guitar distortion effects, we show
that the diffusion-based approach provides more stable results and
is less sensitive to data availability, while the adversarial approach
is superior at estimating more pronounced distortion effects. Our
findings contribute to the robust unsupervised blind estimation of
audio effects, demonstrating the potential of diffusion models for
system identification in music technology.
Download Solid State Bus-Comp: A Large-Scale and Diverse Dataset for Dynamic Range Compressor Virtual Analog Modeling Virtual Analog (VA) modeling aims to simulate the behavior
of hardware circuits via algorithms to replicate their tone digitally.
Dynamic Range Compressor (DRC) is an audio processing module
that controls the dynamics of a track by reducing and amplifying
the volumes of loud and quiet sounds, which is essential in music
production. In recent years, neural-network-based VA modeling has
shown great potential in producing high-fidelity models. However,
due to the lack of data quantity and diversity, their generalization
ability in different parameter settings and input sounds is still limited. To tackle this problem, we present Solid State Bus-Comp, the
first large-scale and diverse dataset for modeling the classical VCA
compressor — SSL 500 G-Bus. Specifically, we manually collected
175 unmastered songs from the Cambridge Multitrack Library. We
recorded the compressed audio in 220 parameter combinations,
resulting in an extensive 2528-hour dataset with diverse genres, instruments, tempos, and keys. Moreover, to facilitate the use of our
proposed dataset, we conducted benchmark experiments in various
open-sourced black-box and grey-box models, as well as white-box
plugins. We also conducted ablation studies in different data subsets to illustrate the effectiveness of the improved data diversity and
quantity. The dataset and demos are on our project page: https:
//www.yichenggu.com/SolidStateBusComp/.
Download Aliasing Reduction in Neural Amp Modeling by Smoothing Activations The increasing demand for high-quality digital emulations of analog audio hardware, such as vintage tube guitar amplifiers, led
to numerous works on neural network-based black-box modeling,
with deep learning architectures like WaveNet showing promising
results. However, a key limitation in all of these models was the
aliasing artifacts stemming from nonlinear activation functions in
neural networks. In this paper, we investigated novel and modified activation functions aimed at mitigating aliasing within neural
amplifier models. Supporting this, we introduced a novel metric,
the Aliasing-to-Signal Ratio (ASR), which quantitatively assesses
the level of aliasing with high accuracy. Measuring also the conventional Error-to-Signal Ratio (ESR), we conducted studies on a
range of preexisting and modern activation functions with varying
stretch factors. Our findings confirmed that activation functions
with smoother curves tend to achieve lower ASR values, indicating a noticeable reduction in aliasing. Notably, this improvement
in aliasing reduction was achievable without a substantial increase
in ESR, demonstrating the potential for high modeling accuracy
with reduced aliasing in neural amp models.
Download Empirical Results for Adjusting Truncated Backpropagation Through Time While Training Neural Audio Effects This paper investigates the optimization of Truncated Backpropagation Through Time (TBPTT) for training neural networks in
digital audio effect modeling, with a focus on dynamic range compression. The study evaluates key TBPTT hyperparameters – sequence number, batch size, and sequence length – and their influence on model performance. Using a convolutional-recurrent architecture, we conduct extensive experiments across datasets with
and without conditioning by user controls. Results demonstrate
that carefully tuning these parameters enhances model accuracy
and training stability, while also reducing computational demands.
Objective evaluations confirm improved performance with optimized settings, while subjective listening tests indicate that the
revised TBPTT configuration maintains high perceptual quality.