Download Wave Digital Modeling of Circuits with Multiple One-Port Nonlinearities Based on Lipschitz-Bounded Neural Networks Neural networks have found application within the Wave Digital Filters (WDFs) framework as data-driven input-output blocks for modeling single one-port or multi-port nonlinear devices in circuit systems. However, traditional neural networks lack predictable bounds for their output derivatives, essential to ensure convergence when simulating circuits with multiple nonlinear elements using fixed-point iterative methods, e.g., the Scattering Iterative Method (SIM). In this study, we address such issue by employing Lipschitz-bounded neural networks for regressing nonlinear WD scattering relations of one-port nonlinearities.
Download Training Neural Models of Nonlinear Multi-Port Elements Within Wave Digital Structures Through Discrete-Time Simulation Neural networks have been applied within the Wave Digital Filter
(WDF) framework as data-driven models for nonlinear multi-port
circuit elements. Conventionally, these models are trained on wave
variables obtained by sampling the current-voltage characteristic
of the considered nonlinear element before being incorporated into
the circuit WDF implementation. However, isolating multi-port
elements for this process can be challenging, as their nonlinear
behavior often depends on dynamic effects that emerge from interactions with the surrounding circuit. In this paper, we propose a
novel approach for training neural models of nonlinear multi-port
elements directly within a circuit’s Wave Digital (WD) discretetime implementation, relying solely on circuit input-output voltage
measurements. Exploiting the differentiability of WD simulations,
we embed the neural network into the simulation process and optimize its parameters using gradient-based methods by minimizing
a loss function defined over the circuit output voltage. Experimental results demonstrate the effectiveness of the proposed approach
in accurately capturing the nonlinear circuit behavior, while preserving the interpretability and modularity of WDFs.
Download Improving Lyrics-to-Audio Alignment Using Frame-wise Phoneme Labels with Masked Cross Entropy Loss This paper addresses the task of lyrics-to-audio alignment, which
involves synchronizing textual lyrics with corresponding music
audio. Most publicly available datasets for this task provide annotations only at the line or word level. This poses a challenge
for training lyrics-to-audio models due to the lack of frame-wise
phoneme labels. However, we find that phoneme labels can be
partially derived from word-level annotations: for single-phoneme
words, all frames corresponding to the word can be labeled with
the same phoneme; for multi-phoneme words, phoneme labels can
be assigned at the first and last frames of the word. To leverage
this partial information, we construct a mask for those frames and
propose a masked frame-wise cross-entropy (CE) loss that considers only frames with known phoneme labels. As a baseline model,
we adopt an autoencoder trained with a Connectionist Temporal
Classification (CTC) loss and a reconstruction loss. We then enhance the training process by incorporating the proposed framewise masked CE loss. Experimental results show that incorporating the frame-wise masked CE loss improves alignment performance. In comparison to other state-of-the art models, our model
provides a comparable Mean Absolute Error (MAE) of 0.216 seconds and a top Median Absolute Error (MedAE) of 0.041 seconds
on the testing Jamendo dataset.
Download A Deep Learning Approach to the Prediction of Time-Frequency Spatial Parameters for Use in Stereo Upmixing This paper presents a deep learning approach to parametric timefrequency parameter prediction for use within stereo upmixing algorithms. The approach presented uses a Multi-Channel U-Net with Residual connections (MuCh-Res-U-Net) trained on a novel dataset of stereo and parametric time-frequency spatial audio data to predict time-frequency spatial parameters from a stereo input signal for positions on a 50-point Lebedev quadrature sampled sphere. An example upmix pipeline is then proposed which utilises the predicted time-frequency spatial parameters to both extract and remap stereo signal components to target spherical harmonic components to facilitate the generation of a full spherical representation of the upmixed sound field.
Download Audio Processor Parameters: Estimating Distributions Instead of Deterministic Values Audio effects and sound synthesizers are widely used processors
in popular music.
Their parameters control the quality of the
output sound. Multiple combinations of parameters can lead to
the same sound.
While recent approaches have been proposed
to estimate these parameters given only the output sound, those
are deterministic, i.e. they only estimate a single solution among
the many possible parameter configurations.
In this work, we
propose to model the parameters as probability distributions instead
of deterministic values. To learn the distributions, we optimize
two objectives: (1) we minimize the reconstruction error between
the ground truth output sound and the one generated using the
estimated parameters, asisit usuallydone, but also(2)we maximize
the parameter diversity, using entropy. We evaluate our approach
through two numerical audio experiments to show its effectiveness.
These results show how our approach effectively outputs multiple
combinations of parameters to match one sound.
Download Vocal Timbre Effects with Differentiable Digital Signal Processing We explore two approaches to creatively altering vocal timbre using Differentiable Digital Signal Processing (DDSP). The first approach is inspired by classic cross-synthesis techniques. A pretrained DDSP decoder predicts a filter for a noise source and a harmonic distribution, based on pitch and loudness information extracted from the vocal input. Before synthesis, the harmonic distribution is modified by interpolating between the predicted distribution and the harmonics of the input. We provide a real-time implementation of this approach in the form of a Neutone model. In the second approach, autoencoder models are trained on datasets consisting of both vocal and instrument training data. To apply the effect, the trained autoencoder attempts to reconstruct the vocal input. We find that there is a desirable “sweet spot” during training, where the model has learned to reconstruct the phonetic content of the input vocals, but is still affected by the timbre of the instrument mixed into the training data. After further training, that effect disappears. A perceptual evaluation compares the two approaches. We find that the autoencoder in the second approach is able to reconstruct intelligible lyrical content without any explicit phonetic information provided during training.
Download Sound Matching Using Synthesizer Ensembles Sound matching allows users to automatically approximate existing sounds using a synthesizer. Previous work has mostly focused on algorithms for automatically programming an existing synthesizer. This paper proposes a system for selecting between different synthesizer designs, each one with a corresponding automatic programmer. An implementation that allows designing ensembles based on a template is demonstrated. Several experiments are presented using a simple subtractive synthesis design. Using an ensemble of synthesizer-programmer pairs is shown to provide better matching than a single programmer trained for an equivalent integrated synthesizer. Scaling to hundreds of synthesizers is shown to improve match quality.
Download Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space This paper presents a novel approach to neural instrument sound
synthesis using a two-stage semi-supervised learning framework
capable of generating pitch-accurate, high-quality music samples
from an expressive timbre latent space. Existing approaches that
achieve sufficient quality for music production often rely on highdimensional latent representations that are difficult to navigate and
provide unintuitive user experiences. We address this limitation
through a two-stage training paradigm: first, we train a pitchtimbre disentangled 2D representation of audio samples using a
Variational Autoencoder; second, we use this representation as
conditioning input for a Transformer-based generative model. The
learned 2D latent space serves as an intuitive interface for navigating and exploring the sound landscape. We demonstrate that the
proposed method effectively learns a disentangled timbre space,
enabling expressive and controllable audio generation with reliable
pitch conditioning. Experimental results show the model’s ability to capture subtle variations in timbre while maintaining a high
degree of pitch accuracy. The usability of our method is demonstrated in an interactive web application, highlighting its potential
as a step towards future music production environments that are
both intuitive and creatively empowering:
https://pgesam.faresschulz.com/.
Download A Non-Uniform Subband Implementation of an Active Noise Control System for Snoring Reduction The snoring noise can be extremely annoying and can negatively
affect people’s social lives. To reduce this problem, active noise
control (ANC) systems can be adopted for snoring cancellation.
Recently, adaptive subband systems have been developed to improve the convergence rate and reduce the computational complexity of the ANC algorithm. Several structures have been proposed
with different approaches. This paper proposes a non-uniform subband adaptive filtering (SAF) structure to improve a feedforward
active noise control algorithm. The non-uniform band distribution
allows for a higher frequency resolution of the lower frequencies,
where the snoring noise is most concentrated. Several experiments
have been carried out to evaluate the proposed system in comparison with a reference ANC system which uses a uniform approach.
Download Spatializing Screen Readers: Extending VoiceOver via Head-Tracked Binaural Synthesis for User Interface Accessibility Traditional screen-based graphical user interfaces (GUIs) pose significant accessibility challenges for visually impaired users. This
paper demonstrates how existing GUI elements can be translated
into an interactive auditory domain using high-order Ambisonics and inertial sensor-based head tracking, culminating in a realtime binaural rendering over headphones. The proposed system
is designed to spatialize the auditory output from VoiceOver, the
built-in macOS screen reader, aiming to foster clearer mental mapping and enhanced navigability.
A between-groups experiment
was conducted to compare standard VoiceOver with the proposed
spatialized version. Non visually-impaired participants (n = 32),
with no visual access to the test interface, completed a list-based
exploration and then attempted to reconstruct the UI solely from
auditory cues. Experimental results indicate that the head-tracked
group achieved a slightly higher accuracy in reconstructing the interface, while user experience assessments showed no significant
differences in self-reported workload or usability. These findings
suggest that potential benefits may come from the integration of
head-tracked binaural audio into mainstream screen-reader workflows, but future investigations involving blind and low-vision users
are needed.
Although the experimental testbed uses a generic
desktop app, our ultimate goal is to tackle the complex visual layouts of music-production software, where an head-tracked audio
approach could benefit visually impaired producers and musicians
navigating plug-in controls.