Download Notes on the use of Variational Autoencoders for Speech and Audio Spectrogram Modeling Variational autoencoders (VAEs) are powerful (deep) generative artificial neural networks. They have been recently used in several papers for speech and audio processing, in particular for the modeling of speech/audio spectrograms. In these papers, very poor theoretical support is given to justify the chosen data representation and decoder likelihood function or the corresponding cost function used for training the VAE. Yet, a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization (NMF) of audio spectrograms and its application to audio source separation. In the present paper, we show how this statistical framework applies to VAE-based speech/audio spectrogram modeling. This provides the latter insights on the choice and interpretability of data representation and model parameterization.
Download Modelling Experts’ Decisions on Assigning Narrative Importances of Objects in a Radio Drama Mix There is an increasing number of consumers of broadcast audio who suffer from a degree of hearing impairment. One of the methods developed for tackling this issue consists of creating customizable object-based audio mixes where users can attenuate parts of the mix using a simple complexity parameter. The method relies on the mixing engineer classifying audio objects in the mix according to their narrative importance. This paper focuses on automating this process. Individual tracks are classified based on their music, speech, or sound effect content. Then the decisions for assigning narrative importance to each segment of a radio drama mix are modelled using mixture distributions. Finally, the learned decisions and resultant mixes are evaluated using the Short Term Objective Intelligibility, with reference to the narrative importance selections made by the original producer. This approach has applications for providing customizable mixes for legacy content, or automatically generated media content where the engineer is not able to intervene.
Download The Shape of RemiXXXes to Come: Audio Texture Synthesis with Time-frequency Scattering This article explains how to apply time–frequency scattering, a convolutional operator extracting modulations in the time–frequency domain at different rates and scales, to the re-synthesis and manipulation of audio textures. After implementing phase retrieval in the scattering network by gradient backpropagation, we introduce scale-rate DAFx, a class of audio transformations expressed in the domain of time–frequency scattering coefficients. One example of scale-rate DAFx is chirp rate inversion, which causes each sonic event to be locally reversed in time while leaving the arrow of time globally unchanged. Over the past two years, our work has led to the creation of four electroacoustic pieces: FAVN; Modulator (Scattering Transform); Experimental Palimpsest; Inspection (Maida Vale Project) and Inspection II; as well as XAllegroX (Hecker Scattering.m Sequence), a remix of Lorenzo Senni’s XAllegroX, released by Warp Records on a vinyl entitled The Shape of RemiXXXes to Come.
Download Statistical Sinusoidal Modeling for Expressive Sound Synthesis Statistical sinusoidal modeling represents a method for transferring a sample library of instrument sounds into a data base of sinusoidal parameters for the use in real time additive synthesis. Single sounds, capturing an instrument in combinations of pitch and intensity, are therefor segmented into attack, sustain and release. Partial amplitudes, frequencies and Bark band energies are calculated for all sounds and segments. For the sustain part, all partial and noise parameters are transformed to probabilistic distributions. Interpolated inverse transform sampling is introduced for generating parameter trajectories during synthesis in real time, allowing the creation of sounds located at pitches and intensities between the actual support points of the sample library. Evaluation is performed by qualitative analysis of the system response to sweeps of the control parameters pitch and intensity. Results for a set of violin samples demonstrate the ability of the approach to model dynamic timbre changes, which is crucial for the perceived quality of expressive sound synthesis.
Download Towards Inverse Virtual Analog Modeling Several digital signal processing approaches, generally referred to as Virtual Analog (VA) modeling, are currently under development for the software emulation of analog audio circuitry. The main purpose of VA modeling is to faithfully reproduce the behavior of real-world audio gear, e.g., distortion effects, synthesizers or amplifiers, using efficient algorithms. In this paper, however, we provide a preliminary discussion about how VA modeling can be exploited to infer the input signal of an analog audio system, given the output signal and the parameters of the circuit. In particular, we show how an inversion theorem known in circuit theory, and based on nullors, can be used for this purpose. As recent advances in Wave Digital Filter (WDF) theory allow us to implement circuits with nullors in a systematic fashion, WDFs prove to be useful tools for inverse VA modeling. WDF realizations of a nonlinear audio system and its inverse are presented as an example of application.
Download Time Scale Modification of Audio Using Non-Negative Matrix Factorization This paper introduces an algorithm for time-scale modification of audio signals based on using non-negative matrix factorization. The activation signals attributed to the detected components are used for identifying sound events. The segmentation of these events is used for detecting and preserving transients. In addition, the algorithm introduces the possibility of preserving the envelopes of overlapping sound events while globally modifying the duration of an audio clip.
Download Audio Transport: A Generalized Portamento via Optimal Transport This paper proposes a new method to interpolate between two audio signals. As an interpolation parameter is changed, the pitches in one signal slide to the pitches in the other, producing a portamento, or musical glide. The assignment of pitches in one sound to pitches in the other is accomplished by solving a 1-dimensional optimal transport problem. In addition, we introduce several techniques that preserve the audio fidelity over this highly nonlinear transformation. A portamento is a natural way for a musician to transition between notes, but traditionally it has only been possible for instruments with a continuously variable pitch like the human voice or the violin. Audio transport extends the portamento to any instrument, even polyphonic ones. Moreover, the effect can be used to transition between different instruments, groups of instruments, or any other pair of audio signals. The audio transport effect operates in real-time; we provide an open-source implementation. In experiments with sinusoidal inputs, the interpolating effect is indistinguishable from ideal sine sweeps. More generally, the effect produces clear, musical results for a wide variety of inputs.
Download Analysis and Correction of Maps Dataset Automatic music transcription (AMT) is the process of converting the original music signal into the digital music symbol. The MIDI Aligned Piano Sounds (MAPS) dataset was established in 2010 and is the most used benchmark dataset for automatic piano music transcription. In this paper, error screening is carried out through algorithm strategy, and three data annotation problems are found in ENSTDkCl, which is a subset of MAPS, usually used for algorithm evaluation: (1) there are 342 deviation errors of midi annotation; (2) there are 803 unplayed note errors; (3) there are 1613 slow starting process errors. After algorithm correction and manual confirmation, the corrected dataset is released. Finally, the better-performing Google model and our model are evaluated on the corrected dataset. The F values are 85.94% and 85.82%, respectively, and it is correspondingly improved compared with the original dataset, which proves that the correction of the dataset is meaningful.
Download Universal Audio Synthesizer Control with Normalizing Flows The ubiquity of sound synthesizers have reshaped music production and even entirely define new music genres. However, the increasing complexity and number of parameters in modern synthesizers make them harder to master. Hence, the development of methods allowing to easily create and explore with synthesizers is a crucial need. Here, we introduce a radically novel formulation of audio synthesizer control by formalizing it as finding an organized continuous latent space of audio that represents the capabilities of a synthesizer and map this space to the space of synthesis parameter. By using this formulation, we show that we can address simultaneously automatic parameter inference, macro-control learning and audio-based preset exploration within a single model. To solve this new formulation, we rely on Variational Auto-Encoders (VAE) and Normalizing Flows (NF) to organize and map the respective auditory and parameter spaces. We introduce a new type of NF named regression flows that allow to perform an invertible mapping between separate latent spaces, while steering the organization of some of the latent dimensions. We evaluate our proposal against a large set of baseline models and show its superiority in both parameter inference and audio reconstruction. We also show that the model disentangles the major factors of audio variations as latent dimensions, that can be directly used as macro-parameters. Finally, we discuss the use of our model in several creative applications and introduce real-time implementations in Ableton Live
Download NMF Toolbox: Music Processing Applications of Nonnegative Matrix Factorization Nonnegative matrix factorization (NMF) is a family of methods widely used for information retrieval across domains including text, images, and audio. Within music processing, NMF has been used for tasks such as transcription, source separation, and structure analysis. Prior work has shown that initialization and constrained update rules can drastically improve the chances of NMF converging to a musically meaningful solution. Along these lines we present the NMF toolbox, containing MATLAB and Python implementations of conceptually distinct NMF variants—in particular, this paper gives an overview for two algorithms. The first variant, called nonnegative matrix factor deconvolution (NMFD), extends the original NMF algorithm to the convolutive case, enforcing the temporal order of spectral templates. The second variant, called diagonal NMF, supports the development of sparse diagonal structures in the activation matrix. Our toolbox contains several demo applications and code examples to illustrate its potential and functionality. By providing MATLAB and Python code on a documentation website under a GNU-GPL license, as well as including illustrative examples, our aim is to foster research and education in the field of music processing.