Download Balancing Error and Latency of Black-Box Models for Audio Effects Using Hardware-Aware Neural Architecture Search
In this paper, we address automating and systematizing the process of finding black-box models for virtual analogue audio effects with an optimal balance between error and latency. We introduce a multi-objective optimization approach based on hardware-aware neural architecture search which allows specifying the optimization balance of model error and latency according to the requirements of the application. By using a regularized evolutionary algorithm, it is able to navigate through a huge search space systematically. Additionally, we propose a search space for modelling non-linear dynamic audio effects consisting of over 41 trillion different WaveNet-style architectures. We evaluate its performance and usefulness by yielding highly effective architectures, either up to 18× faster or with a test loss of up to 56% less than the best performing models of the related work, while still showing a favourable trade-off. We can conclude that hardware-aware neural architecture search is a valuable tool that can help researchers and engineers developing virtual analogue models by automating the architecture design and saving time by avoiding manual search and evaluation through trial-and-error.
Download Identification of Nonlinear Circuits as Port-Hamiltonian Systems
This paper addresses identification of nonlinear circuits for power-balanced virtual analog modeling and simulation. The proposed method combines a port-Hamiltonian system formulation with kernel-based methods to retrieve model laws from measurements. This combination allows for the estimated model to retain physical properties that are crucial for the accuracy of simulations, while representing a variety of nonlinear behaviors. As an illustration, the method is used to identify a nonlinear passive peaking EQ.
Download The Sounding Gesture: An Overview
Sound control by gesture is a peculiar topic in Human-Computer Interaction: many different approaches to it are available, focusing each time on diversified perspectives. Our point of view is an interdisciplinary one: taking into account technical considerations about control theory and sound processing, we try to explore the expressiveness world which is closer to psychology theories. Starting from a state of the art which outlines two main approaches to the problem of ”making sound with gestures”, we will delve into psychological theories about expressiveness, describing in particular possible applications dealing with intermodality and mixed reality environments related to the Gestalt Theory. HCI design can indeed benefit from this kind of approach because of the quantitative methods that can be applied to measure expressiveness. Interfaces can be used in order to convey expressiveness, which is a plus of information that can help interacting with the machine; this kind of information can be coded as spatio-temporal schemes, as it is stated in Gestalt theory.
Download KRONOS ‐ A Vectorizing Compiler for Music DSP
This paper introduces Kronos, a vectorizing Just in Time compiler designed for musical programming systems. Its purpose is to translate abstract mathematical expressions into high performance computer code. Musical programming system design criteria are considered and a three-tier model of abstraction is presented. The low level expression Metalanguage used in Kronos is described, along with the design choices that facilitate powerful, yet transparent vectorization of the machine code.
Download Multimodal Interfaces for Expressive Sound Control
This paper introduces research issues on multimodal interaction and interfaces for expressive sound control. We introduce Multisensory Integrated Expressive Environments (MIEEs) as a framework for Mixed Reality applications in the performing arts. Paradigmatic contexts for applications of MIEEs are multimedia concerts, interactive dance / music / video installations, interactive museum exhibitions, distributed cooperative environments for theatre and artistic expression. MIEEs are user-centred systems able to interpret the high-level information conveyed by performers through their expressive gestures and to establish an effective multisensory experience taking into account expressive, emotional, affective content. The lecture discusses some main issues for MIEEs and presents the EyesWeb (www.eyesweb.org) open software platform which has been recently redesigned (version 4) in order to better address MIEE requirements. Short live demonstrations are also presented.
Download The Shape of RemiXXXes to Come: Audio Texture Synthesis with Time-frequency Scattering
This article explains how to apply time–frequency scattering, a convolutional operator extracting modulations in the time–frequency domain at different rates and scales, to the re-synthesis and manipulation of audio textures. After implementing phase retrieval in the scattering network by gradient backpropagation, we introduce scale-rate DAFx, a class of audio transformations expressed in the domain of time–frequency scattering coefficients. One example of scale-rate DAFx is chirp rate inversion, which causes each sonic event to be locally reversed in time while leaving the arrow of time globally unchanged. Over the past two years, our work has led to the creation of four electroacoustic pieces: FAVN; Modulator (Scattering Transform); Experimental Palimpsest; Inspection (Maida Vale Project) and Inspection II; as well as XAllegroX (Hecker Scattering.m Sequence), a remix of Lorenzo Senni’s XAllegroX, released by Warp Records on a vinyl entitled The Shape of RemiXXXes to Come.
Download Local Key estimation Based on Harmonic and Metric Structures
In this paper, we present a method for estimating the local keys of an audio signal. We propose to address the problem of local key finding by investigating the possible combination and extension of different previous proposed global key estimation approaches. The specificity of our approach is that we introduce key dependency on the harmonic and the metric structures. In this work, we focus on the relationship between the chord progression and the local key progression in a piece of music. A contribution of our work is that we address the problem of finding a good analysis window length for local key estimation by introducing information related to the metric structure in our model. Key estimation is not performed on empirical-chosen segment length but on segments that are adapted to the analyzed piece and independent from the tempo. We evaluate and analyze our results on a new database composed of classical music pieces.
Download Empirical Results for Adjusting Truncated Backpropagation Through Time While Training Neural Audio Effects
This paper investigates the optimization of Truncated Backpropagation Through Time (TBPTT) for training neural networks in digital audio effect modeling, with a focus on dynamic range compression. The study evaluates key TBPTT hyperparameters – sequence number, batch size, and sequence length – and their influence on model performance. Using a convolutional-recurrent architecture, we conduct extensive experiments across datasets with and without conditioning by user controls. Results demonstrate that carefully tuning these parameters enhances model accuracy and training stability, while also reducing computational demands. Objective evaluations confirm improved performance with optimized settings, while subjective listening tests indicate that the revised TBPTT configuration maintains high perceptual quality.
Download Self-Supervised Disentanglement of Harmonic and Rhythmic Features in Music Audio Signals
The aim of latent variable disentanglement is to infer the multiple informative latent representations that lie behind a data generation process and is a key factor in controllable data generation. In this paper, we propose a deep neural network-based self-supervised learning method to infer the disentangled rhythmic and harmonic representations behind music audio generation. We train a variational autoencoder that generates an audio mel-spectrogram from two latent features representing the rhythmic and harmonic content. In the training phase, the variational autoencoder is trained to reconstruct the input mel-spectrogram given its pitch-shifted version. At each forward computation in the training phase, a vector rotation operation is applied to one of the latent features, assuming that the dimensions of the feature vectors are related to pitch intervals. Therefore, in the trained variational autoencoder, the rotated latent feature represents the pitch-related information of the mel-spectrogram, and the unrotated latent feature represents the pitch-invariant information, i.e., the rhythmic content. The proposed method was evaluated using a predictor-based disentanglement metric on the learned features. Furthermore, we demonstrate its application to the automatic generation of music remixes.
Download Multi-Player Microtiming Humanisation using a Multivariate Markov Model
In this paper, we present a model for the modulation of multiperformer microtiming variation in musical groups. This is done using a multivariate Markov model, in which the relationship between players is modelled using an interdependence matrix (α) and a multidimensional state transition matrix (S). This method allows us to generate more natural sounding musical sequences due to the reduction of out-of-phase errors that occur in Gaussian pseudorandom and player-independent probabilistic models. We verify this using subjective listening tests, where we demonstrate that our multivariate model is able to outperform commonly used univariate models at producing human-like microtiming variability. Whilst the participants in our study judged the real time sequences performed by humans to be more natural than the proposed model, we were still able to achieve a mean score of 63.39% naturalness, suggesting microtiming interdependence between players captured in our model significantly enhances the humanisation of group musical sequences.