Download Towards Morphological Sound Description using segmental models
We present an approach to model the temporal evolution of audio descriptors using Segmental Models (SMs). This method yields a signal segmentation into a sequence of primitives, constituted by a set of user-defined trajectories . This allows one to consider specific primitive shapes, model their duration and to take into account the time dependence between successive signal frames, contrary to standard Hidden Markov Models. We applied this approach to a database of violin playing. Various types of glissando and dynamics variations were specifically recorded. The results show that our approach using Segmental Models provides a segmentation that can be easily interpreted. Quantitatively, the Segmental Models performed better than standard implementation of Hidden Markov Models.
Download A Matlab Toolbox for Musical Feature Extraction from Audio
We present MIRtoolbox, an integrated set of functions written in Matlab, dedicated to the extraction of musical features from audio files. The design is based on a modular framework: the different algorithms are decomposed into stages, formalized using a minimal set of elementary mechanisms, and integrating different variants proposed by alternative approaches – including new strategies we have developed –, that users can select and parametrize. This paper offers an overview of the set of features, related, among others, to timbre, tonality, rhythm or form, that can be extracted with MIRtoolbox. Four particular analyses are provided as examples. The toolbox also includes functions for statistical analysis, segmentation and clustering. Particular attention has been paid to the design of a syntax that offers both simplicity of use and transparent adaptiveness to a multiplicity of possible input types. Each feature extraction method can accept as argument an audio file, or any preliminary result from intermediary stages of the chain of operations. Also the same syntax can be used for analyses of single audio files, batches of files, series of audio segments, multichannel signals, etc. For that purpose, the data and methods of the toolbox are organised in an object-oriented architecture.
Download Subjective Evaluation of Sound Quality and Control of Drum Synthesis with Stylewavegan
In this paper we investigate into perceptual properties of StyleWaveGAN, a drum synthesis method proposed in a previous publication. For both, the sound quality as well as the control precision StyleWaveGAN has been shown to deliver state of the art performance for quantitative metrics (FAD and MSE of the control parameters). The present paper aims to provide insight into the perceptual relevance of these results. Accordingly, we performed a subjective evaluation of the sound quality as well as a subjective evaluation of the precision of the control using timbre descriptors from the AudioCommons toolbox. We evaluate the sound quality with mean opinion score and make measurements of psychophysical response to the variations of the control. By means of the perceptual tests, we demonstrate that StyleWaveGAN produces better sound quality than state-of-the-art model DrumGAN and that the mean control error is lower than the absolute threshold of perception at every point of measurement used in the experiment.
Download Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics
Timbre spaces have been used in music perception to study the perceptual relationships between instruments based on dissimilarity ratings. However, these spaces do not generalize to novel examples and do not provide an invertible mapping, preventing audio synthesis. In parallel, generative models have aimed to provide methods for synthesizing novel timbres. However, these systems do not provide an understanding of their inner workings and are usually not related to any perceptually relevant information. Here, we show that Variational Auto-Encoders (VAE) can alleviate all of these limitations by constructing generative timbre spaces. To do so, we adapt VAEs to learn an audio latent space, while using perceptual ratings from timbre studies to regularize the organization of this space. The resulting space allows us to analyze novel instruments, while being able to synthesize audio from any point of this space. We introduce a specific regularization allowing to enforce any given similarity distances onto these spaces. We show that the resulting space provide almost similar distance relationships as timbre spaces. We evaluate several spectral transforms and show that the Non-Stationary Gabor Transform (NSGT) provides the highest correlation to timbre spaces and the best quality of synthesis. Furthermore, we show that these spaces can generalize to novel instruments and can generate any path between instruments to understand their timbre relationships. As these spaces are continuous, we study how audio descriptors behave along the latent dimensions. We show that even though descriptors have an overall non-linear topology, they follow a locally smooth evolution. Based on this, we introduce a method for descriptor-based synthesis and show that we can control the descriptors of an instrument while keeping its timbre structure.
Download Solid State Bus-Comp: A Large-Scale and Diverse Dataset for Dynamic Range Compressor Virtual Analog Modeling
Virtual Analog (VA) modeling aims to simulate the behavior of hardware circuits via algorithms to replicate their tone digitally. Dynamic Range Compressor (DRC) is an audio processing module that controls the dynamics of a track by reducing and amplifying the volumes of loud and quiet sounds, which is essential in music production. In recent years, neural-network-based VA modeling has shown great potential in producing high-fidelity models. However, due to the lack of data quantity and diversity, their generalization ability in different parameter settings and input sounds is still limited. To tackle this problem, we present Solid State Bus-Comp, the first large-scale and diverse dataset for modeling the classical VCA compressor — SSL 500 G-Bus. Specifically, we manually collected 175 unmastered songs from the Cambridge Multitrack Library. We recorded the compressed audio in 220 parameter combinations, resulting in an extensive 2528-hour dataset with diverse genres, instruments, tempos, and keys. Moreover, to facilitate the use of our proposed dataset, we conducted benchmark experiments in various open-sourced black-box and grey-box models, as well as white-box plugins. We also conducted ablation studies in different data subsets to illustrate the effectiveness of the improved data diversity and quantity. The dataset and demos are on our project page: https: //www.yichenggu.com/SolidStateBusComp/.
Download Improved Automatic Instrumentation Role Classification and Loop Activation Transcription
Many electronic music (EM) genres are composed through the activation of short audio recordings of instruments designed for seamless repetition—or loops. In this work, loops of key structural groups such as bass, percussive or melodic elements are labelled by the role they occupy in a piece of music through the task of automatic instrumentation role classification (AIRC). Such labels assist EM producers in the identification of compatible loops in large unstructured audio databases. While human annotation is often laborious, automatic classification allows for fast and scalable generation of these labels. We experiment with several deeplearning architectures and propose a data augmentation method for improving multi-label representation to balance classes within the Freesound Loop Dataset. To improve the classification accuracy of the architectures, we also evaluate different pooling operations. Results indicate that in combination with the data augmentation and pooling strategies, the proposed system achieves state-of-theart performance for AIRC. Additionally, we demonstrate how our proposed AIRC method is useful for analysing the structure of EM compositions through loop activation transcription.
Download Distortion Recovery: A Two-Stage Method for Guitar Effect Removal
Removing audio effects from electric guitar recordings makes it easier for post-production and sound editing. An audio distortion recovery model not only improves the clarity of the guitar sounds but also opens up new opportunities for creative adjustments in mixing and mastering. While progress have been made in creating such models, previous efforts have largely focused on synthetic distortions that may be too simplistic to accurately capture the complexities seen in real-world recordings. In this paper, we tackle the task by using a dataset of guitar recordings rendered with commercial-grade audio effect VST plugins. Moreover, we introduce a novel two-stage methodology for audio distortion recovery. The idea is to firstly process the audio signal in the Mel-spectrogram domain in the first stage, and then use a neural vocoder to generate the pristine original guitar sound from the processed Mel-spectrogram in the second stage. We report a set of experiments demonstrating the effectiveness of our approach over existing methods, through both subjective and objective evaluation metrics.
Download Classification Of Music Signals In The Visual Domain
With the huge increase in the availability of digital music, it has become more important to automate the task of querying a database of musical pieces. At the same time, a computational solution of this task might give us an insight into how humans perceive and classify music. In this paper, we discuss our attempts to classify music into three broad categories: rock, classical and jazz. We discuss the feature extraction process and the particular choice of features that we used- spectrograms and mel scaled cepstral coefficients (MFCC). We use the texture-of- texture models to generate feature vectors out of these. Together, these features are capable of capturing the frequency-power profile of the sound as the song proceeds. Finally, we attempt to classify the generated data using a variety of classifiers. we discuss our results and the inferences that can be drawn from them.
Download Audio-Visual Talker Localization in Video for Spatial Sound Reproduction
Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating the automatic extraction of the positional metadata of the talker relative to the camera’s reference frame. With the integration of the visual modality, this study expands upon our previous investigation focused solely on audio-based active speaker detection and localization. Our experiments compare conventional audio-visual approaches for active speaker detection that leverage monaural audio, our previous audio-only method that leverages multichannel recordings from a microphone array, and a novel audio-visual approach integrating vision and multichannel audio. We found the role of the two modalities to complement each other. Multichannel audio, overcoming the problem of visual occlusions, provides a double-digit reduction in detection error compared to audio-visual methods with single-channel audio. The combination of multichannel audio and vision further enhances spatial accuracy, leading to a four-percentage point increase in F1 score on the Tragic Talkers dataset. Future investigations will assess the robustness of the model in noisy and highly reverberant environments, as well as tackle the problem of off-screen speakers.
Download Differentiable MIMO Feedback Delay Networks for Multichannel Room Impulse Response Modeling
Recently, with the advent of new performing headsets and goggles, the demand for Virtual and Augmented Reality applications has experienced a steep increase. In order to coherently navigate the virtual rooms, the acoustics of the scene must be emulated in the most accurate and efficient way possible. Amongst others, Feedback Delay Networks (FDNs) have proved to be valuable tools for tackling such a task. In this article, we expand and adapt a method recently proposed for the data-driven optimization of single-inputsingle-output FDNs to the multiple-input-multiple-output (MIMO) case for addressing spatial/space-time processing applications. By testing our methodology on items taken from two different datasets, we show that the parameters of MIMO FDNs can be jointly optimized to match some perceptual characteristics of given multichannel room impulse responses, overcoming approaches available in the literature, and paving the way toward increasingly efficient and accurate real-time virtual room acoustics rendering.