Download Reservoir Computing: a powerful Framework for Nonlinear Audio Processing
This paper proposes reservoir computing as a general framework for nonlinear audio processing. Reservoir computing is a novel approach to recurrent neural network training with the advantage of a very simple and linear learning algorithm. It can in theory approximate arbitrary nonlinear dynamical systems with arbitrary precision, has an inherent temporal processing capability and is therefore well suited for many nonlinear audio processing problems. Always when nonlinear relationships are present in the data and time information is crucial, reservoir computing can be applied. Examples from three application areas are presented: nonlinear system identification of a tube amplifier emulator algorithm, nonlinear audio prediction, as necessary in a wireless transmission of audio where dropouts may occur, and automatic melody transcription out of a polyphonic audio stream, as one example from the big field of music information retrieval. Reservoir computing was able to outperform state-of-the-art alternative models in all studied tasks.
Download Development of a Quality Assurance Automatic Listening Machine (QuAALM)
This paper describes the development and application of a machine listening system for the automated testing of implementation equivalence in music signal processing effects which contain a high level of randomized time variation. We describe a mathematical model of generalized randomization in audio effects and explore different representations of the effect’s data. We then propose a set of classifiers to reliably determine if two implementations of the same randomized audio effect are functionally equivalent. After testing these classifiers against each other and against a set of human listeners we find the best implementation and determine that it agrees with the judgment of human listeners with an F1-Score of 0.8696.
Download Generating Musical Accompaniment Using Finite State Transducers
The finite state transducer (FST), a type of finite state machine that maps an input string to an output string, is a common tool in the fields of natural language processing and speech recognition. FSTs have also been applied to music-related tasks such as audio fingerprinting and the generation of musical accompaniment. In this paper, we describe a system that uses an FST to generate harmonic accompaniment to a melody. We provide details of the methods employed to quantize a music signal, the topology of the transducer, and discuss our approach to evaluating the system. We argue for an evaluation metric that takes into account the quality of the generated accompaniment, rather than one that returns a binary value indicating the correctness or incorrectness of the accompaniment.
Download Data Augmentation for Instrument Classification Robust to Audio Effects
Reusing recorded sounds (sampling) is a key component in Electronic Music Production (EMP), which has been present since its early days and is at the core of genres like hip-hop or jungle. Commercial and non-commercial services allow users to obtain collections of sounds (sample packs) to reuse in their compositions. Automatic classification of one-shot instrumental sounds allows automatically categorising the sounds contained in these collections, allowing easier navigation and better characterisation. Automatic instrument classification has mostly targeted the classification of unprocessed isolated instrumental sounds or detecting predominant instruments in mixed music tracks. For this classification to be useful in audio databases for EMP, it has to be robust to the audio effects applied to unprocessed sounds. In this paper we evaluate how a state of the art model trained with a large dataset of one-shot instrumental sounds performs when classifying instruments processed with audio effects. In order to evaluate the robustness of the model, we use data augmentation with audio effects and evaluate how each effect influences the classification accuracy.
Download An active learning procedure for the interaural time difference discrimination threshold
Measuring the auditory lateralization elicited by interaural time difference (ITD) cues involves the estimation of a psychometric function (PF). The shape of this function usually follows from the analysis of the subjective data and models the probability of correctly localizing the angular position of a sound source. The present study describes and evaluates a procedure for progressively fitting a PF, using Gaussian process classification of the subjective responses produced during a binary decision experiment. The process refines adaptively an approximated PF, following Bayesian inference. At each trial, it suggests the most informative auditory stimulus for function refinement according to Bayesian active learning by disagreement (BALD) mutual information. In this paper, the procedure was modified to accommodate two-alternative forced choice (2AFC) experimental methods and then was compared with a standard adaptive “three-down, one-up” staircase procedure. Our process approximates the average threshold ITD 79.4% correct level of lateralization with a mean accuracy increase of 8.9% over the Weibull function fitted on the data of the same test. The final accuracy for the Just Noticeable Difference (JND) in ITD is achieved with only 37.6% of the trials needed by a standard lateralization test.
Download Generative Latent Spaces for Neural Synthesis of Audio Textures
This paper investigates the synthesis of audio textures and the structure of generative latent spaces using Variational Autoencoders (VAEs) within two paradigms of neural audio synthesis: DSP-inspired and data-driven approaches. For each paradigm, we propose VAE-based frameworks that allow fine-grained temporal control. We introduce datasets across three categories of environmental sounds to support our investigations. We evaluate and compare the models’ reconstruction performance using objective metrics, and investigate their generative capabilities and latent space structure through latent space interpolations.
Download A Study of Control Methods for Percussive Sound Synthesis Based on Gans
The process of creating drum sounds has seen significant evolution in the past decades. The development of analogue drum synthesizers, such as the TR-808, and modern sound design tools in Digital Audio Workstations led to a variety of drum timbres that defined entire musical genres. Recently, drum synthesis research has been revived with a new focus on training generative neural networks to create drum sounds. Different interfaces have previously been proposed to control the generative process, from low-level latent space navigation to high-level semantic feature parameterisation, but no comprehensive analysis has been presented to evaluate how each approach relates to the creative process. We aim to evaluate how different interfaces support creative control over drum generation by conducting a user study based on the Creative Support Index. We experiment with both a supervised method that decodes semantic latent space directions and an unsupervised Closed-Form Factorization approach from computer vision literature to parameterise the generation process and demonstrate that the latter is the preferred means to control a drum synthesizer based on the StyleGAN2 network architecture.
Download A Generative Model for Raw Audio Using Transformer Architectures
This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures. We propose a deep neural network for generating waveforms, similar to wavenet . This is fully probabilistic, auto-regressive, and causal, i.e. each sample generated depends on only the previously observed samples. Our approach outperforms a widely used wavenet architecture by up to 9% on a similar dataset for predicting the next step. Using the attention mechanism, we enable the architecture to learn which audio samples are important for the prediction of the future sample. We show how causal transformer generative models can be used for raw waveform synthesis. We also show that this performance can be improved by another 2% by conditioning samples over a wider context. The flexibility of the current model to synthesize audio from latent representations suggests a large number of potential applications. The novel approach of using generative transformer architectures for raw audio synthesis is, however, still far away from generating any meaningful music similar to wavenet, without using latent codes/meta-data to aid the generation process.
Download Unsupervised Text-to-Sound Mapping via Embedding Space Alignment
This work focuses on developing an artistic tool that performs an unsupervised mapping between text and sound, converting an input text string into a series of sounds from a given sound corpus. With the use of a pre-trained sound embedding model and a separate, pre-trained text embedding model, the goal is to find a mapping between the two feature spaces. Our approach is unsupervised which allows any sound corpus to be used with the system. The tool performs the task of text-to-sound retrieval, creating a soundfile in which each word in the text input is mapped to a single sound in the corpus, and the resulting sounds are concatenated to play sequentially. We experiment with three different mapping methods, and perform quantitative and qualitative evaluations on the outputs. Our results demonstrate the potential of unsupervised methods for creative applications in text-to-sound mapping.
Download Increasing Drum Transcription Vocabulary Using Data Synthesis
Current datasets for automatic drum transcription (ADT) are small and limited due to the tedious task of annotating onset events. While some of these datasets contain large vocabularies of percussive instrument classes (e.g. ~20 classes), many of these classes occur very infrequently in the data. This paucity of data makes it difficult to train models that support such large vocabularies. Therefore, data-driven drum transcription models often focus on a small number of percussive instrument classes (e.g. 3 classes). In this paper, we propose to support large-vocabulary drum transcription by generating a large synthetic dataset (210,000 eight second examples) of audio examples for which we have groundtruth transcriptions. Using this synthetic dataset along with existing drum transcription datasets, we train convolutional-recurrent neural networks (CRNNs) in a multi-task framework to support large-vocabulary ADT. We find that training on both the synthetic and real music drum transcription datasets together improves performance on not only large-vocabulary ADT, but also beat / downbeat detection small-vocabulary ADT.