Download Modal Spring Reverb Based on Discretisation of the Thin Helical Spring Model The distributed nature of coupling in helical springs presents specific challenges in obtaining efficient computational structures
for accurate spring reverb simulation. For direct simulation approaches, such as finite-difference methods, this is typically manifested in significant numerical dispersion within the hearing range.
Building on a recent study of a simpler spring model, this paper presents an alternative discretisation approach that employs
higher-order spatial approximations and applies centred stencils at
the boundaries to address the underlying linear-system eigenvalue
problem. Temporal discretisation is then applied to the resultant
uncoupled mode system, rendering an efficient and flexible modal
reverb structure. Through dispersion analysis it is shown that numerical dispersion errors can be kept extremely small across the
hearing range for a relatively low number of system nodes. Analysis of an impulse response simulated using model parameters calculated from a measured spring geometry confirms that the model
captures an enhanced set of spring characteristics.
Download Transition-Aware: A More Robust Approach for Piano Transcription Piano transcription is a classic problem in music information retrieval. More and more transcription methods based on deep learning have been proposed in recent years. In 2019, Google Brain
published a larger piano transcription dataset, MAESTRO. On this
dataset, Onsets and Frames transcription approach proposed by
Hawthorne achieved a stunning onset F1 score of 94.73%. Unlike
the annotation method of Onsets and Frames, Transition-aware
model presented in this paper annotates the attack process of piano
signals called atack transition in multiple frames, instead of only
marking the onset frame. In this way, the piano signals around
onset time are taken into account, enabling the detection of piano onset more stable and robust. Transition-aware achieves a
higher transcription F1 score than Onsets and Frames on MAESTRO dataset and MAPS dataset, reducing many extra note detection errors. This indicates that Transition-aware approach has
better generalization ability on different datasets.
Download Interpretable timbre synthesis using variational autoencoders regularized on timbre descriptors Controllable timbre synthesis has been a subject of research for several decades, and deep neural networks have been the most successful in this area. Deep generative models such as Variational Autoencoders (VAEs) have the ability to generate a high-level representation of audio while providing a structured latent space. Despite their advantages, the interpretability of these latent spaces in terms of human perception is often limited. To address this limitation and enhance the control over timbre generation, we propose a regularized VAE-based latent space that incorporates timbre descriptors. Moreover, we suggest a more concise representation of sound by utilizing its harmonic content, in order to minimize the dimensionality of the latent space.
Download On the Estimation of Sinusoidal Parameters via Parabolic Interpolation of Scaled Magnitude Spectra Sinusoids are widely used to represent the oscillatory modes of
music and speech. The estimation of the sinusoidal parameters
directly affects the quality of the representation. A parabolic interpolation of the peaks of the log-magnitude spectrum is commonly
used to get a more accurate estimation of the frequencies and the
amplitudes of the sinusoids at a relatively low computational cost.
Recently, Werner and Germain proposed an improved sinusoidal estimator that performs parabolic interpolation of the peaks
of a power-scaled magnitude spectrum. For each analysis window type and size, a power-scaling factor p is pre-calculated via
a computationally demanding heuristic. Consequently, the powerscaling estimation method is currently constrained to a few tabulated power-scaling factors for pre-selected window sizes, limiting
its practical applications. In this article, we propose a method to
obtain the power-scaling factor p for any window size from the
tabulated values. Additionally, we investigate the impact of zeropadding on the estimation accuracy of the power-scaled sinusoidal
parameter estimator.
Download A Hierarchical Deep Learning Approach for Minority Instrument Detection Identifying instrument activities within audio excerpts is vital in music information retrieval, with significant implications for music cataloging and discovery. Prior deep learning endeavors in musical instrument recognition have predominantly emphasized instrument classes with ample data availability. Recent studies have demonstrated the applicability of hierarchical classification in detecting instrument activities in orchestral music, even with limited fine-grained annotations at the instrument level. Based on the Hornbostel-Sachs classification, such a hierarchical classification system is evaluated using the MedleyDB dataset, renowned for its diversity and richness concerning various instruments and music genres. This work presents various strategies to integrate hierarchical structures into models and tests a new class of models for hierarchical music prediction. This study showcases more reliable coarse-level instrument detection by bridging the gap between detailed instrument identification and group-level recognition, paving the way for further advancements in this domain.
Download Vocal Tract Area Estimation by Gradient Descent Articulatory features can provide interpretable and flexible controls for the synthesis of human vocalizations by allowing the user to directly modify parameters like vocal strain or lip position. To make this manipulation through resynthesis possible, we need to estimate the features that result in a desired vocalization directly from audio recordings. In this work, we propose a white-box optimization technique for estimating glottal source parameters and vocal tract shapes from audio recordings of human vowels. The approach is based on inverse filtering and optimizing the frequency response of a waveguide model of the vocal tract with gradient descent, propagating error gradients through the mapping of articulatory features to the vocal tract area function. We apply this method to the task of matching the sound of the Pink Trombone, an interactive articulatory synthesizer, to a given vocalization. We find that our method accurately recovers control functions for audio generated by the Pink Trombone itself. We then compare our technique against evolutionary optimization algorithms and a neural network trained to predict control parameters from audio. A subjective evaluation finds that our approach outperforms these black-box optimization baselines on the task of reproducing human vocalizations.
Download ICGAN: An Implicit Conditioning Method for Interpretable Feature Control of Neural Audio Synthesis Neural audio synthesis methods can achieve high-fidelity and realistic sound generation by utilizing deep generative models. Such models typically rely on external labels which are often discrete as conditioning information to achieve guided sound generation. However, it remains difficult to control the subtle changes in sounds without appropriate and descriptive labels, especially given a limited dataset. This paper proposes an implicit conditioning method for neural audio synthesis using generative adversarial networks that allows for interpretable control of the acoustic features of synthesized sounds. Our technique creates a continuous conditioning space that enables timbre manipulation without relying on explicit labels. We further introduce an evaluation metric to explore controllability and demonstrate that our approach is effective in enabling a degree of controlled variation of different synthesized sound effects for in-domain and cross-domain sounds.
Download Improving Synthesizer Programming From Variational Autoencoders Latent Space Deep neural networks have been recently applied to the task of
automatic synthesizer programming, i.e., finding optimal values
of sound synthesis parameters in order to reproduce a given input
sound. This paper focuses on generative models, which can infer
parameters as well as generate new sets of parameters or perform
smooth morphing effects between sounds.
We introduce new models to ensure scalability and to increase
performance by using heterogeneous representations of parameters as numerical and categorical random variables.
Moreover,
a spectral variational autoencoder architecture with multi-channel
input is proposed in order to improve inference of parameters related to the pitch and intensity of input sounds.
Model performance was evaluated according to several criteria
such as parameters estimation error and audio reconstruction accuracy. Training and evaluation were performed using a 30k presets
dataset which is published with this paper. They demonstrate significant improvements in terms of parameter inference and audio
accuracy and show that presented models can be used with subsets
or full sets of synthesizer parameters.
Download Practical Virtual Analog Modeling Using Möbius Transforms Möbius transforms provide for the definition of a family of onestep discretization methods offering a framework for alleviating
well-known limitations of common one-step methods, such as the
trapezoidal method, at no cost in model compactness or complexity. In this paper, we extend the existing theory around these methods. Here, we show how it can be applied to common frameworks
used to structure virtual analog models. Then, we propose practical strategies to tune the transform parameters for best simulation
results. Finally, we show how such strategies enable us to formulate much improved non-oversampled virtual analog models for
several historical audio circuits.
Download A Study of Control Methods for Percussive Sound Synthesis Based on Gans The process of creating drum sounds has seen significant evolution in the past decades. The development of analogue drum synthesizers, such as the TR-808, and modern sound design tools in Digital Audio Workstations led to a variety of drum timbres that defined entire musical genres. Recently, drum synthesis research has been revived with a new focus on training generative neural networks to create drum sounds. Different interfaces have previously been proposed to control the generative process, from low-level latent space navigation to high-level semantic feature parameterisation, but no comprehensive analysis has been presented to evaluate how each approach relates to the creative process. We aim to evaluate how different interfaces support creative control over drum generation by conducting a user study based on the Creative Support Index. We experiment with both a supervised method that decodes semantic latent space directions and an unsupervised Closed-Form Factorization approach from computer vision literature to parameterise the generation process and demonstrate that the latter is the preferred means to control a drum synthesizer based on the StyleGAN2 network architecture.