Download Kronos VST – The Programmable Effect Plugin
This paper introduces Kronos VST, an audio effect plugin conforming to the VST 3 standard that can be programmed on the fly by the user, allowing entire signal processors to be defined in real time. A brief survey of existing programmable plugins or development aids for audio effect plugins is given. Kronos VST includes a functional just in time compiler that produces high performance native machine code from high level source code. The features of the Kronos programming language are briefly covered, followed by the special considerations of integrating user programs into the VST infrastructure. Finally, introductory example programs are provided.
Download Automatic Violin Synthesis Using Expressive Musical Term Features
The control of interpretational properties such as duration, vibrato, and dynamics is important in music performance. Musicians continuously manipulate such properties to achieve different expressive intentions. This paper presents a synthesis system that automatically converts a mechanical, deadpan interpretation to distinct expressions by controlling these expressive factors. Extending from a prior work on expressive musical term analysis, we derive a subset of essential features as the control parameters, such as the relative time position of the energy peak in a note and the mean temporal length of the notes. An algorithm is proposed to manipulate the energy contour (i.e. for dynamics) of a note. The intended expressions of the synthesized sounds are evaluated in terms of the ability of the machine model developed in the prior work. Ten musical expressions such as Risoluto and Maestoso are considered, and the evaluation is done using held-out music pieces. Our evaluations show that it is easier for the machine to recognize the expressions of the synthetic version, comparing to those of the real recordings of an amateur student. While a listening test is under construction as a next step for further performance validation, this work represents to our best knowledge a first attempt to build and quantitatively evaluate a system for EMT analysis/synthesis.
Download Training Neural Models of Nonlinear Multi-Port Elements Within Wave Digital Structures Through Discrete-Time Simulation
Neural networks have been applied within the Wave Digital Filter (WDF) framework as data-driven models for nonlinear multi-port circuit elements. Conventionally, these models are trained on wave variables obtained by sampling the current-voltage characteristic of the considered nonlinear element before being incorporated into the circuit WDF implementation. However, isolating multi-port elements for this process can be challenging, as their nonlinear behavior often depends on dynamic effects that emerge from interactions with the surrounding circuit. In this paper, we propose a novel approach for training neural models of nonlinear multi-port elements directly within a circuit’s Wave Digital (WD) discretetime implementation, relying solely on circuit input-output voltage measurements. Exploiting the differentiability of WD simulations, we embed the neural network into the simulation process and optimize its parameters using gradient-based methods by minimizing a loss function defined over the circuit output voltage. Experimental results demonstrate the effectiveness of the proposed approach in accurately capturing the nonlinear circuit behavior, while preserving the interpretability and modularity of WDFs.
Download A Statistics-Driven Differentiable Approach for Sound Texture Synthesis and Analysis
In this work, we introduce TexStat, a novel loss function specifically designed for the analysis and synthesis of texture sounds characterized by stochastic structure and perceptual stationarity. Drawing inspiration from the statistical and perceptual framework of McDermott and Simoncelli, TexStat identifies similarities between signals belonging to the same texture category without relying on temporal structure. We also propose using TexStat as a validation metric alongside Frechet Audio Distances (FAD) to evaluate texture sound synthesis models. In addition to TexStat, we present TexEnv, an efficient, lightweight and differentiable texture sound synthesizer that generates audio by imposing amplitude envelopes on filtered noise. We further integrate these components into TexDSP, a DDSP-inspired generative model tailored for texture sounds. Through extensive experiments across various texture sound types, we demonstrate that TexStat is perceptually meaningful, time-invariant, and robust to noise, features that make it effective both as a loss function for generative tasks and as a validation metric. All tools and code are provided as open-source contributions and our PyTorch implementations are efficient, differentiable, and highly configurable, enabling its use in both generative tasks and as a perceptually grounded evaluation metric.
Download Differentiable White-Box Virtual Analog Modeling
Component-wise circuit modeling, also known as “white-box” modeling, is a well established and much discussed technique in virtual analog modeling. This approach is generally limited in accuracy by lack of access to the exact component values present in a real example of the circuit. In this paper we show how this problem can be addressed by implementing the white-box model in a differentiable form, and allowing approximate component values to be learned from raw input–output audio measured from a real device.
Download Drum Translation for Timbral and Rhythmic Transformation
Many recent approaches to creative transformations of musical audio have been motivated by the success of raw audio generation models such as WaveNet, in which audio samples are modeled by generative neural networks. This paper describes a generative audio synthesis model for multi-drum translation based on a WaveNet denosing autoencoder architecture. The timbre of an arbitrary source audio input is transformed to sound as if it were played by various percussive instruments while preserving its rhythmic structure. Two evaluations of the transformations are conducted based on the capacity of the model to preserve the rhythmic patterns of the input and the audio quality as it relates to timbre of the target drum domain. The first evaluation measures the rhythmic similarities between the source audio and the corresponding drum translations, and the second provides a numerical analysis of the quality of the synthesised audio. Additionally, a semi- and fully-automatic audio effect has been proposed, in which the user may assist the system by manually labelling source audio segments or use a state-of-the-art automatic drum transcription system prior to drum translation.
Download Removing Lavalier Microphone Rustle With Recurrent Neural Networks
The noise that lavalier microphones produce when rubbing against clothing (typically referred to as rustle) can be extremely difficult to automatically remove because it is highly non-stationary and overlaps with speech in both time and frequency. Recent breakthroughs in deep neural networks have led to novel techniques for separating speech from non-stationary background noise. In this paper, we apply neural network speech separation techniques to remove rustle noise, and quantitatively compare multiple deep network architectures and input spectral resolutions. We find the best performance using bidirectional recurrent networks and spectral resolution of around 20 Hz. Furthermore, we propose an ambience preservation post-processing step to minimize potential gating artifacts during pauses in speech.
Download Lookup Table Based Audio Spectral Transformation
We present a unified visual interface for flexible spectral audio manipulation based on editable lookup tables (LUTs). In the proposed approach, the audio spectrum is visualized as a two-dimensional color map of frequency versus amplitude, serving as an editable lookup table for modifying the sound. This single tool can replicate common audio effects such as equalization, pitch shifting, and spectral compression, while also enabling novel sound transformations through creative combinations of adjustments. By consolidating these capabilities into one visual platform, the system has the potential to streamline audio-editing workflows and encourage creative experimentation. The approach also supports real-time processing, providing immediate auditory feedback in an interactive graphical environment. Overall, this LUT-based method offers an accessible yet powerful framework for designing and applying a broad range of spectral audio effects through intuitive visual manipulation.
Download Real-Time Transcription and Separation of Drum Recordings Based on NMF Decompositon
This paper proposes a real-time capable method for transcribing and separating occurrences of single drum instruments in polyphonic drum recordings. Both the detection and the decomposition are based on Non-Negative Matrix Factorization and can be implemented with very small systemic delay. We propose a simple modification to the update rules that allows to capture timedynamic spectral characteristics of the involved drum sounds. The method can be applied in music production and music education software. Performance results with respect to drum transcription are presented and discussed. The evaluation data-set consisting of annotated drum recordings is published for use in further studies in the field. Index Terms - drum transcription, source separation, nonnegative matrix factorization, spectral processing, audio plug-in, music production, music education
Download Audio Morphing Using Matrix Decomposition and Optimal Transport
This paper presents a system for morphing between audio recordings in a continuous parameter space. The proposed approach combines matrix decompositions used for audio source separation with displacement interpolation enabled by 1D optimal transport. By interpolating the spectral components obtained using nonnegative matrix factorization of the source and target signals, the system allows varying the timbre of a sound in real time, while maintaining its temporal structure. Using harmonic / percussive source separation as a pre-processing step, the system affords more detailed control of the interpolation in perceptually meaningful dimensions.