Paper Archive

Deep neural networks have been recently applied to the task of automatic synthesizer programming, i.e., finding optimal values of sound synthesis parameters in order to reproduce a given input sound. This paper focuses on generative models, which can infer parameters as well as generate new sets of parameters or perform smooth morphing effects between sounds. We introduce new models to ensure scalability and to increase performance by using heterogeneous representations of parameters as numerical and categorical random variables. Moreover, a spectral variational autoencoder architecture with multi-channel input is proposed in order to improve inference of parameters related to the pitch and intensity of input sounds. Model performance was evaluated according to several criteria such as parameters estimation error and audio reconstruction accuracy. Training and evaluation were performed using a 30k presets dataset which is published with this paper. They demonstrate significant improvements in terms of parameter inference and audio accuracy and show that presented models can be used with subsets or full sets of synthesizer parameters.

Exposure Bias and State Matching in Recurrent Neural Network Virtual Analog Models

Aleksi Peussa; Eero-Pekka Damskägg; Thomas Sherson; Stylianos I. Mimilakis; Lauri Juvela; Athanasios Gotsopoulos; Vesa Välimäki

Virtual analog (VA) modeling using neural networks (NNs) has great potential for rapidly producing high-fidelity models. Recurrent neural networks (RNNs) are especially appealing for VA due to their connection with discrete nodal analysis. Furthermore, VA models based on NNs can be trained efficiently by directly exposing them to the circuit states in a gray-box fashion. However, exposure to ground truth information during training can leave the models susceptible to error accumulation in a free-running mode, also known as “exposure bias” in machine learning literature. This paper presents a unified framework for treating the previously proposed state trajectory network (STN) and gated recurrent unit (GRU) networks as special cases of discrete nodal analysis. We propose a novel circuit state-matching mechanism for the GRU and experimentally compare the previously mentioned networks for their performance in state matching, during training, and in exposure bias, during inference. Experimental results from modeling a diode clipper show that all the tested models exhibit some exposure bias, which can be mitigated by truncated backpropagation through time. Furthermore, the proposed state matching mechanism improves the GRU modeling performance of an overdrive pedal and a phaser pedal, especially in the presence of external modulation, apparent in a phaser circuit.

Transition-Aware: A More Robust Approach for Piano Transcription

Xianke Wang; Wei Xu; Juanting Liu; Weiming Yang; Wenqing Cheng

Piano transcription is a classic problem in music information retrieval. More and more transcription methods based on deep learning have been proposed in recent years. In 2019, Google Brain published a larger piano transcription dataset, MAESTRO. On this dataset, Onsets and Frames transcription approach proposed by Hawthorne achieved a stunning onset F1 score of 94.73%. Unlike the annotation method of Onsets and Frames, Transition-aware model presented in this paper annotates the attack process of piano signals called atack transition in multiple frames, instead of only marking the onset frame. In this way, the piano signals around onset time are taken into account, enabling the detection of piano onset more stable and robust. Transition-aware achieves a higher transcription F1 score than Onsets and Frames on MAESTRO dataset and MAPS dataset, reducing many extra note detection errors. This indicates that Transition-aware approach has better generalization ability on different datasets.

Quality Diversity for Synthesizer Sound Matching

Naotake Masuda; Daisuke Saito

It is difficult to adjust the parameters of a complex synthesizer to create the desired sound. As such, sound matching, the estimation of synthesis parameters that can replicate a certain sound, is a task that has often been researched, utilizing optimization methods such as genetic algorithm (GA). In this paper, we introduce a novelty-based objective for GA-based sound matching. Our contribution is two-fold. First, we show that the novelty objective is able to improve the quality of sound matching by maintaining phenotypic diversity in the population. Second, we introduce a quality diversity approach to the problem of sound matching, aiming to find a diverse set of matching sounds. We show that the novelty objective is effective in producing high-performing solutions that are diverse in terms of specified audio features. This approach allows for a new way of discovering sounds and exploring the capabilities of a synthesizer.

An Audio-Visual Fusion Piano Transcription Approach Based on Strategy

Xianke Wang; Wei Xu; Juanting Liu; Weiming Yang; Wenqing Cheng

Piano transcription is a fundamental problem in the field of music information retrieval. At present, a large number of transcriptional studies are mainly based on audio or video, yet there is a small number of discussion based on audio-visual fusion. In this paper, a piano transcription model based on strategy fusion is proposed, in which the transcription results of the video model are used to assist audio transcription. Due to the lack of datasets currently used for audio-visual fusion, the OMAPS data set is proposed in this paper. Meanwhile, our strategy fusion model achieves a 92.07% F1 score on OMAPS dataset. The transcription model based on feature fusion is also compared with the one based on strategy fusion. The experiment results show that the transcription model based on strategy fusion achieves better results than the one based on feature fusion.

Model Bending: Teaching Circuit Models New Tricks

Daniel J. Gillespie; Samuel Schachter

A technique is introduced for generating novel signal processing systems grounded in analog electronic circuits, called model bending. By applying the ideas behind circuit bending to models of nonlinear analog circuits it is possible to create novel nonlinear signal processors which mimic the behavior of analog electronics, but which are not possible to implement in the analog realm. The history of both circuit bending and circuit modeling is discussed, as well as a theoretical basis for how these approaches can complement each other. Potential pitfalls to the practical application of model bending are highlighted and suggested solutions to those problems are provided, with examples.

Virtual Analog Modeling of Distortion Circuits Using Neural Ordinary Differential Equations

Jan Wilczek; Alec Wright; Emanuël A. P. Habets; Vesa Välimäki

Recent research in deep learning has shown that neural networks can learn differential equations governing dynamical systems. In this paper, we adapt this concept to Virtual Analog (VA) modeling to learn the ordinary differential equations (ODEs) governing the first-order and the second-order diode clipper. The proposed models achieve performance comparable to state-of-the-art recurrent neural networks (RNNs) albeit using fewer parameters. We show that this approach does not require oversampling and allows to increase the sampling rate after the training has completed, which results in increased accuracy. Using a sophisticated numerical solver allows to increase the accuracy at the cost of slower processing. ODEs learned this way do not require closed forms but are still physically interpretable.

A Virtual Analog Model of the Edp Wasp VCF

Lasse Köper; Martin Holters; Fabián Esqueda; Julian D. Parker

In this paper we present a virtual analog model of the voltagecontrolled filter used in the EDP Wasp synthesizer. This circuit is an interesting case study for virtual analog modeling due to its characteristic nonlinear and highly dynamic behavior which can be attributed to its unusual design. The Wasp filter consists of a state variable filter topology implemented using operational transconductance amplifiers (OTAs) as the cutoff-control elements and CMOS inverters in lieu of operational amplifiers, all powered by a unipolar power supply. In order to accurately model the behavior of the circuit we propose extended models for its nonlinear components, focusing particularly on the OTAs. The proposed component models are used inside a white-box circuit modeling framework to create a digital simulation of the filter which retains the interesting characteristics of the original device.

Modeling and Extending the Rca Mark Ii Sound Effects Filter

Kurt James Werner; Ezra J. Teboul; Seth Cluett; Emma Azelborn

We have analyzed the Sound Effects Filter from the one-of-a-kind RCA Mark II sound synthesizer and modeled it as a Wave Digital Filter using the Faust language, to make this once exclusive device widely available. By studying the original schematics and measurements of the device, we discovered several circuit modifications. Building on these, we proposed a number of extensions to the circuit which increase its usefulness in music production.

Analysis of Musical Dynamics in Vocal Performances Using Loudness Measures

Jyoti Narang; Marius Miron; Ajay Srinivasamurthy; Xavier Serra

In addition to tone, pitch and rhythm, dynamics is one of the expressive dimensions of the performance of a music piece that has received limited attention. While the usage of dynamics may vary from artist to artist, and also from performance to performance, a systematic methodology to automatically identify the dynamics of a performance in terms of musically meaningful terms like forte, piano may offer valuable feedback in the context of music education and in particular in singing. To this end, we have manually annotated the dynamic markings of commercial recordings of popular rock and pop songs from the Smule Vocal Balanced (SVB) dataset which will be used as reference data. Then as a first step for our research goal, we propose a method to derive and compare singing voice loudness curves in polyphonic mixtures. Towards measuring the similarity and variation of dynamics, we compare the dynamics curves of the SVB renditions with the one derived from the original songs. We perform the same comparison using professionally produced renditions from a karaoke website. We relate high values of Spearman correlation coefficient found in some select student renditions and the professional renditions with accurate dynamics.