Download A Hierarchical Deep Learning Approach for Minority Instrument Detection
Identifying instrument activities within audio excerpts is vital in music information retrieval, with significant implications for music cataloging and discovery. Prior deep learning endeavors in musical instrument recognition have predominantly emphasized instrument classes with ample data availability. Recent studies have demonstrated the applicability of hierarchical classification in detecting instrument activities in orchestral music, even with limited fine-grained annotations at the instrument level. Based on the Hornbostel-Sachs classification, such a hierarchical classification system is evaluated using the MedleyDB dataset, renowned for its diversity and richness concerning various instruments and music genres. This work presents various strategies to integrate hierarchical structures into models and tests a new class of models for hierarchical music prediction. This study showcases more reliable coarse-level instrument detection by bridging the gap between detailed instrument identification and group-level recognition, paving the way for further advancements in this domain.
Download Equalizing Loudspeakers in Reverberant Environments Using Deep Convolutive Dereverberation
Loudspeaker equalization is an established topic in the literature, and currently many techniques are available to address most practical use cases. However, most of these rely on accurate measurements of the loudspeaker in an anechoic environment, which in some occurrences is not feasible. This is the case, e.g. of custom digital organs, which have a set of loudspeakers that are built into a large and geometrically-complex piece of furniture, which may be too heavy and large to be transported to a measurement room, or may require a big one, making traditional impulse response measurements impractical for most users. In this work we propose a method to find the inverse of the sound emission system in a reverberant environment, based on a Deep Learning dereverberation algorithm. The method is agnostic of the room characteristics and can be, thus, conducted in an automated fashion in any environment. A real use case is discussed and results are provided, showing the effectiveness of the approach in designing filters that match closely the magnitude response of the ideal inverting filters.
Download A Deep Learning Approach to the Prediction of Time-Frequency Spatial Parameters for Use in Stereo Upmixing
This paper presents a deep learning approach to parametric timefrequency parameter prediction for use within stereo upmixing algorithms. The approach presented uses a Multi-Channel U-Net with Residual connections (MuCh-Res-U-Net) trained on a novel dataset of stereo and parametric time-frequency spatial audio data to predict time-frequency spatial parameters from a stereo input signal for positions on a 50-point Lebedev quadrature sampled sphere. An example upmix pipeline is then proposed which utilises the predicted time-frequency spatial parameters to both extract and remap stereo signal components to target spherical harmonic components to facilitate the generation of a full spherical representation of the upmixed sound field.
Download Towards Efficient Modelling of String Dynamics: A Comparison of State Space and Koopman Based Deep Learning Methods
This paper presents an examination of State Space Models (SSM) and Koopman-based deep learning methods for modelling the dynamics of both linear and non-linear stiff strings. Through experiments with datasets generated under different initial conditions and sample rates, we assess the capacity of these models to accurately model the complex behaviours observed in string dynamics. Our findings indicate that our proposed Koopman-based model performs as well as or better than other existing approaches in nonlinear cases for long-sequence modelling. We inform the design of these architectures with the structure of the problems at hand. Although challenges remain in extending model predictions beyond the training horizon (i.e., extrapolation), the focus of our investigation lies in the models’ ability to generalise across different initial conditions within the training time interval. This research contributes insights into the physical modelling of dynamical systems (in particular those addressing musical acoustics) by offering a comparative overview of these and previous methods and introducing innovative strategies for model improvement. Our results highlight the efficacy of these models in simulating non-linear dynamics and emphasise their wide-ranging applicability in accurately modelling dynamical systems over extended sequences.
Download Neural Audio Processing on Android Phones
This study investigates the potential of real-time inference of neural audio effects on Android smartphones, marking an initial step towards bridging the gap in neural audio processing for mobile devices. Focusing exclusively on processing rather than synthesis, we explore the performance of three open-source neural models across five Android phones released between 2014 and 2022, showcasing varied capabilities due to their generational differences. Through comparative analysis utilizing two C++ inference engines (ONNX Runtime and RTNeural), we aim to evaluate the computational efficiency and timing performance of these models, considering the varying computational loads and the hardware specifics of each device. Our work contributes insights into the feasibility of implementing neural audio processing in real-time on mobile platforms, highlighting challenges and opportunities for future advancements in this rapidly evolving field.
Download Distortion Recovery: A Two-Stage Method for Guitar Effect Removal
Removing audio effects from electric guitar recordings makes it easier for post-production and sound editing. An audio distortion recovery model not only improves the clarity of the guitar sounds but also opens up new opportunities for creative adjustments in mixing and mastering. While progress have been made in creating such models, previous efforts have largely focused on synthetic distortions that may be too simplistic to accurately capture the complexities seen in real-world recordings. In this paper, we tackle the task by using a dataset of guitar recordings rendered with commercial-grade audio effect VST plugins. Moreover, we introduce a novel two-stage methodology for audio distortion recovery. The idea is to firstly process the audio signal in the Mel-spectrogram domain in the first stage, and then use a neural vocoder to generate the pristine original guitar sound from the processed Mel-spectrogram in the second stage. We report a set of experiments demonstrating the effectiveness of our approach over existing methods, through both subjective and objective evaluation metrics.
Download Automatic Equalization for Individual Instrument Tracks Using Convolutional Neural Networks
We propose a novel approach for the automatic equalization of individual musical instrument tracks. Our method begins by identifying the instrument present within a source recording in order to choose its corresponding ideal spectrum as a target. Next, the spectral difference between the recording and the target is calculated, and accordingly, an equalizer matching model is used to predict settings for a parametric equalizer. To this end, we build upon a differentiable parametric equalizer matching neural network, demonstrating improvements relative to previously established state-of-the-art. Unlike past approaches, we show how our system naturally allows real-world audio data to be leveraged during the training of our matching model, effectively generating suitably produced training targets in an automated manner mirroring conditions at inference time. Consequently, we illustrate how fine-tuning our matching model on such examples considerably improves parametric equalizer matching performance in realworld scenarios, decreasing mean absolute error by 24% relative to methods relying solely on random parameter sampling techniques as a self-supervised learning strategy. We perform listening tests, and demonstrate that our proposed automatic equalization solution subjectively enhances the tonal characteristics for recordings of common instrument types.
Download Guitar Tone Stack Modeling with a Neural State-Space Filter
In this work, we present a data-driven approach to modeling tone stack circuits in guitar amplifiers and distortion pedals. To this aim, the proposed modeling approach uses a feedforward fully connected neural network to predict the parameters of a coupledform state-space filter, ensuring the numerical stability of the resulting time-varying system. The neural network is conditioned on the tone controls of the target tone stack and is optimized jointly with the coupled-form state-space filter to match the target frequency response. To assess the proposed approach, we model three popular tone stack schematics with both matched-order and overparameterized filters and conduct an objective comparison with well-established approaches that use cascaded biquad filters. Results from the conducted experiments demonstrate improved accuracy of the proposed modeling approach, especially in the case of over-parameterized state-space filters while guaranteeing numerical stability. Our method can be deployed, after training, in realtime audio processors.
Download Balancing Error and Latency of Black-Box Models for Audio Effects Using Hardware-Aware Neural Architecture Search
In this paper, we address automating and systematizing the process of finding black-box models for virtual analogue audio effects with an optimal balance between error and latency. We introduce a multi-objective optimization approach based on hardware-aware neural architecture search which allows specifying the optimization balance of model error and latency according to the requirements of the application. By using a regularized evolutionary algorithm, it is able to navigate through a huge search space systematically. Additionally, we propose a search space for modelling non-linear dynamic audio effects consisting of over 41 trillion different WaveNet-style architectures. We evaluate its performance and usefulness by yielding highly effective architectures, either up to 18× faster or with a test loss of up to 56% less than the best performing models of the related work, while still showing a favourable trade-off. We can conclude that hardware-aware neural architecture search is a valuable tool that can help researchers and engineers developing virtual analogue models by automating the architecture design and saving time by avoiding manual search and evaluation through trial-and-error.
Download Hyper Recurrent Neural Network: Condition Mechanisms for Black-Box Audio Effect Modeling
Recurrent neural networks (RNNs) have demonstrated impressive results for virtual analog modeling of audio effects. These networks process time-domain audio signals using a series of matrix multiplication and nonlinear activation functions to emulate the behavior of the target device accurately. To additionally model the effect of the knobs for an RNN-based model, existing approaches integrate control parameters by concatenating them channel-wisely with some intermediate representation of the input signal. While this method is parameter-efficient, there is room to further improve the quality of generated audio because the concatenation-based conditioning method has limited capacity in modulating signals. In this paper, we propose three novel conditioning mechanisms for RNNs, tailored for black-box virtual analog modeling. These advanced conditioning mechanisms modulate the model based on control parameters, yielding superior results to existing RNN- and CNN-based architectures across various evaluation metrics.