Download Differentiable Time–frequency Scattering on GPU Joint time–frequency scattering (JTFS) is a convolutional operator in the time–frequency domain which extracts spectrotemporal modulations at various rates and scales. It offers an idealized model of spectrotemporal receptive fields (STRF) in the primary auditory cortex, and thus may serve as a biological plausible surrogate for human perceptual judgments at the scale of isolated audio events. Yet, prior implementations of JTFS and STRF have remained outside of the standard toolkit of perceptual similarity measures and evaluation methods for audio generation. We trace this issue down to three limitations: differentiability, speed, and flexibility. In this paper, we present an implementation of time–frequency scattering in Python. Unlike prior implementations, ours accommodates NumPy, PyTorch, and TensorFlow as backends and is thus portable on both CPU and GPU. We demonstrate the usefulness of JTFS via three applications: unsupervised manifold learning of spectrotemporal modulations, supervised classification of musical instruments, and texture resynthesis of bioacoustic sounds.
Download Subjective Evaluation of Sound Quality and Control of Drum Synthesis with Stylewavegan In this paper we investigate into perceptual properties of StyleWaveGAN, a drum synthesis method proposed in a previous publication. For both, the sound quality as well as the control precision StyleWaveGAN has been shown to deliver state of the art performance for quantitative metrics (FAD and MSE of the control parameters). The present paper aims to provide insight into the perceptual relevance of these results. Accordingly, we performed a subjective evaluation of the sound quality as well as a subjective evaluation of the precision of the control using timbre descriptors from the AudioCommons toolbox. We evaluate the sound quality with mean opinion score and make measurements of psychophysical response to the variations of the control. By means of the perceptual tests, we demonstrate that StyleWaveGAN produces better sound quality than state-of-the-art model DrumGAN and that the mean control error is lower than the absolute threshold of perception at every point of measurement used in the experiment.
Download Neural Music Instrument Cloning From Few Samples Neural music instrument cloning is an application of deep neural networks for imitating the timbre of a particular music instrument recording with a trained neural network. One can create such clones using an approach such as DDSP [1], which has been shown to achieve good synthesis quality for several instrument types [2]. However, this approach needs about ten minutes of audio data from the instrument of interest (target recording audio). In this work, we modify the DDSP architecture and apply transfer learning techniques used in speech voice cloning [3] to significantly reduce the amount of target recording audio required. We compare various cloning approaches and architectures across durations of target recording audio, ranging from four to 256 seconds. We demonstrate editing of loudness and pitch as well as timbre transfer from only 16 seconds of target recording audio. Our code is available online1 as well as many audio examples.2
Download Real-Time Implementation of the Dynamic Stiff String Using Finite-Difference Time-Domain Methods and the Dynamic Grid Digital musical instruments based on physical modelling have gained increased popularity over the past years. This is partly due to recent advances in computational power, which allow for their real-time implementation. One of the great potentials for digital musical instruments based on physical models, is that one can go beyond what is physically possible and change properties of the instruments which are static in real life. This paper presents a real-time implementation of the dynamic stiff string using finitedifference time-domain (FDTD) methods. The defining parameters of the string can be varied in real time and change the underlying grid that these methods rely on based on the recently developed dynamic grid method. For most settings, parameter changes are nearly instantaneous and do not cause noticeable artefacts due to changes in the grid. A reliable way to prevent artefacts for all settings is under development.
Download Physical Modeling Using Recurrent Neural Networks with Fast Convolutional Layers Discrete-time modeling of acoustic, mechanical and electrical systems is a prominent topic in the musical signal processing literature. Such models are mostly derived by discretizing a mathematical model, given in terms of ordinary or partial differential equations, using established techniques. Recent work has applied the techniques of machine-learning to construct such models automatically from data for the case of systems which have lumped states described by scalar values, such as electrical circuits. In this work, we examine how similar techniques are able to construct models of systems which have spatially distributed rather than lumped states. We describe several novel recurrent neural network structures, and show how they can be thought of as an extension of modal techniques. As a proof of concept, we generate synthetic data for three physical systems and show that the proposed network structures can be trained with this data to reproduce the behavior of these systems.
Download Differentiable Piano Model for Midi-to-Audio Performance Synthesis Recent neural-based synthesis models have achieved impressive results for musical instrument sound generation. In particular, the Differentiable Digital Signal Processing (DDSP) framework enables the usage of spectral modeling analysis and synthesis techniques in fully differentiable architectures. Yet currently, it has only been used for modeling monophonic instruments. Leveraging the interpretability and modularity of this framework, the present work introduces a polyphonic differentiable model for piano sound synthesis, conditioned on Musical Instrument Digital Interface (MIDI) inputs. The model architecture is motivated by high-level acoustic modeling knowledge of the instrument which, in tandem with the sound structure priors inherent to the DDSP components, makes for a lightweight, interpretable and realistic sounding piano model. The proposed model has been evaluated in a listening test, demonstrating improved sound quality compared to a benchmark neural-based piano model, with significantly less parameters and even with reduced training data. The same listening test indicates that physical-modeling-based models still achieve better quality, but the differentiability of our lightened approach encourages its usage in other musical tasks dealing with polyphonic audio and symbolic data.
Download Continuous State Modeling for Statistical Spectral Synthesis Continuous State Markovian Spectral Modeling is a novel approach for parametric synthesis of spectral modeling parameters, based on the sines plus noise paradigm. The method aims specifically at capturing shimmer and jitter - micro-fluctuations in the partials’ frequency and amplitude trajectories, which are essential for the timbre of musical instruments. It allows for parametric control over the timbral qualities, while removing the need for the more computationally expensive and restrictive process of the discrete state space modeling method. A qualitative comparison between an original violin sound and a re-synthesis shows the ability of the algorithm to reproduce the micro-fluctuations, considering their stochastic and spectral properties.
Download Analysis of Musical Dynamics in Vocal Performances Using Loudness Measures In addition to tone, pitch and rhythm, dynamics is one of the expressive dimensions of the performance of a music piece that has received limited attention. While the usage of dynamics may vary from artist to artist, and also from performance to performance, a systematic methodology to automatically identify the dynamics of a performance in terms of musically meaningful terms like forte, piano may offer valuable feedback in the context of music education and in particular in singing. To this end, we have manually annotated the dynamic markings of commercial recordings of popular rock and pop songs from the Smule Vocal Balanced (SVB) dataset which will be used as reference data. Then as a first step for our research goal, we propose a method to derive and compare singing voice loudness curves in polyphonic mixtures. Towards measuring the similarity and variation of dynamics, we compare the dynamics curves of the SVB renditions with the one derived from the original songs. We perform the same comparison using professionally produced renditions from a karaoke website. We relate high values of Spearman correlation coefficient found in some select student renditions and the professional renditions with accurate dynamics.
Download HD-AD: A New Approach to Audio Atomic Decomposition with Hyperdimensional Computing In this paper, we approach the problem of atomic decomposition of audio at the symbolic level of atom parameters through the lens of hyperdimensional computing (HDC) – a non-traditional computing paradigm. Existing atomic decomposition algorithms often operate using waveforms from a redundant dictionary of atoms causing them to become increasingly memory/computationally intensive as the signal length grows and/or the atoms become more complicated. We systematically build an atom encoding using vector function architecture (VFA), a field of HDC. We train a neural network encoder on synthetic audio signals to generate these encodings and observe that the network can generalize to real recordings. This system, we call Hyperdimensional Atomic Decomposition (HD-AD), avoids time-domain correlations all together. Because HD-AD scales with the sparsity of the signal, rather than its length in time, atomic decompositions are often produced much faster than real-time.
Download A Low-Latency Quasi-Linear-Phase Octave Graphic Equalizer This paper proposes a low-latency quasi-linear-phase octave graphic equalizer. The structure is derived from a recent linearphase graphic equalizer based on interpolated finite impulse response (IFIR) filters. The proposed system reduces the total latency of the previous equalizer by implementing a hybrid structure. An infinite impulse response (IIR) shelving filter is used in the structure to implement the first band of the equalizer, whereas the rest of the band filters are realized with the linear-phase FIR structure. The introduction of the IIR filter causes a nonlinear phase response in the low frequencies, but the total latency is reduced by 50% in comparison to the linear-phase equalizer. The proposed graphic equalizer is useful in real-time audio processing, where only little latency is tolerated.