Download Dimensionality Reduction Techniques for Fear Emotion Detection from Speech In this paper, we propose to reduce the relatively high-dimension of pitch-based features for fear emotion recognition from speech. To do so, the K-nearest neighbors algorithm has been used to classify three emotion classes: fear, neutral and ’other emotions’. Many techniques of dimensionality reduction are explored. First of all, optimal features ensuring better emotion classification are determined. Next, several families of dimensionality reduction, namely PCA, LDA and LPP, are tested in order to reveal the suitable dimension range guaranteeing the highest overall and fear recognition rates. Results show that the optimal features group permits 93.34% and 78.7% as overall and fear accuracy rates respectively. Using dimensionality reduction, Principal Component Analysis (PCA) has given the best results: 92% as overall accuracy rate and 93.3% as fear recognition percentage.
Download Subjective Evaluation of Sound Quality and Control of Drum Synthesis with Stylewavegan In this paper we investigate into perceptual properties of StyleWaveGAN, a drum synthesis method proposed in a previous publication. For both, the sound quality as well as the control precision StyleWaveGAN has been shown to deliver state of the art performance for quantitative metrics (FAD and MSE of the control parameters). The present paper aims to provide insight into the perceptual relevance of these results. Accordingly, we performed a subjective evaluation of the sound quality as well as a subjective evaluation of the precision of the control using timbre descriptors from the AudioCommons toolbox. We evaluate the sound quality with mean opinion score and make measurements of psychophysical response to the variations of the control. By means of the perceptual tests, we demonstrate that StyleWaveGAN produces better sound quality than state-of-the-art model DrumGAN and that the mean control error is lower than the absolute threshold of perception at every point of measurement used in the experiment.
Download Improved Automatic Instrumentation Role Classification and Loop Activation Transcription Many electronic music (EM) genres are composed through the activation of short audio recordings of instruments designed for seamless repetition—or loops. In this work, loops of key structural groups such as bass, percussive or melodic elements are labelled by the role they occupy in a piece of music through the task of automatic instrumentation role classification (AIRC). Such labels assist EM producers in the identification of compatible loops in large unstructured audio databases. While human annotation is often laborious, automatic classification allows for fast and scalable generation of these labels. We experiment with several deeplearning architectures and propose a data augmentation method for improving multi-label representation to balance classes within the Freesound Loop Dataset. To improve the classification accuracy of the architectures, we also evaluate different pooling operations. Results indicate that in combination with the data augmentation and pooling strategies, the proposed system achieves state-of-theart performance for AIRC. Additionally, we demonstrate how our proposed AIRC method is useful for analysing the structure of EM compositions through loop activation transcription.
Download Maximum Filter Vibrato Suppression for Onset Detection We present SuperFlux - a new onset detection algorithm with vibrato suppression. It is an enhanced version of the universal spectral flux onset detection algorithm, and reduces the number of false positive detections considerably by tracking spectral trajectories with a maximum filter. Especially for music with heavy use of vibrato (e.g., sung operas or string performances), the number of false positive detections can be reduced by up to 60% without missing any additional events. Algorithm performance was evaluated and compared to state-of-the-art methods on the basis of three different datasets comprising mixed audio material (25,927 onsets), violin recordings (7,677 onsets) and operatic solo voice recordings (1,448 onsets). Due to its causal nature, the algorithm is applicable in both offline and online real-time scenarios.
Download Unsupervised Feature Learning for Speech and Music Detection in Radio Broadcasts Detecting speech and music is an elementary step in extracting information from radio broadcasts. Existing solutions either rely on general-purpose audio features, or build on features specifically engineered for the task. Interpreting spectrograms as images, we can apply unsupervised feature learning methods from computer vision instead. In this work, we show that features learned by a mean-covariance Restricted Boltzmann Machine partly resemble engineered features, but outperform three hand-crafted feature sets in speech and music detection on a large corpus of radio recordings. Our results demonstrate that unsupervised learning is a powerful alternative to knowledge engineering.
Download An Extension for Source Separation Techniques Avoiding Beats The problem of separating individual sound sources from a mixture of these, known as Source Separation or Computational Auditory Scene Analysis (CASA), has become popular in the recent decades. A number of methods have emerged from the study of this problem, some of which perform very well for certain types of audio sources, e.g. speech. For separation of instruments in music, there are several shortcomings. In general when instruments play together they are not independent of each other. More specifically the time-frequency distributions of the different sources will overlap. Harmonic instruments in particular have high probability of overlapping partials. If these overlapping partials are not separated properly, the separated signals will have a different sensation of roughness, and the separation quality degrades. In this paper we present a method to separate overlapping partials in stereo signals. This method looks at the shapes of partial envelopes, and uses minimization of the difference between such shapes in order to demix overlapping partials. The method can be applied to enhance existing methods for source separation, e.g. blind source separation techniques, model based techniques, and spatial separation techniques. We also discuss other simpler methods that can work with mono signals.
Download Sound Source Separation: Preprocessing For Hearing Aids And Structured Audio Codin In this paper we consider the problem of separating different sound sources in multichannel audio signals. Different approaches to the problem of Blind Source Separation (BSS), e.g. the Independent Component Analysis (ICA) originally proposed by Herault and Jutten, and extensions to this including delays, work fine for artificially mixed signals. However the quality of the separated signals is severely degraded for real sound recordings when there is reverberation. We consider the system with 2 sources and 2 sensors, and show how we can improve the quality of the separation by a simple model of the audio scene. More specifically we estimate the delays between the sensor signals, and put constraints on the deconvolution coefficients.
Download Realistic Gramophone Noise Synthesis Using a Diffusion Model This paper introduces a novel data-driven strategy for synthesizing gramophone noise audio textures. A diffusion probabilistic model is applied to generate highly realistic quasiperiodic noises. The proposed model is designed to generate samples of length equal to one disk revolution, but a method to generate plausible periodic variations between revolutions is also proposed. A guided approach is also applied as a conditioning method, where an audio signal generated with manually-tuned signal processing is refined via reverse diffusion to improve realism. The method has been evaluated in a subjective listening test, in which the participants were often unable to recognize the synthesized signals from the real ones. The synthetic noises produced with the best proposed unconditional method are statistically indistinguishable from real noise recordings. This work shows the potential of diffusion models for highly realistic audio synthesis tasks.
Download Joint Estimation of Fader and Equalizer Gains of DJ Mixers Using Convex Optimization Disc jockeys (DJs) use audio effects to make a smooth transition from one song to another. There have been attempts to computationally analyze the creative process of seamless mixing. However, only a few studies estimated fader or equalizer (EQ) gains controlled by DJs. In this study, we propose a method that jointly estimates time-varying fader and EQ gains so as to reproduce the mix from individual source tracks. The method approximates the equalizer filters with a linear combination of a fixed equalizer filter and a constant gain to convert the joint estimation into a convex optimization problem. For the experiment, we collected a new DJ mix dataset that consists of 5,040 real-world DJ mixes with 50,742 transitions, and evaluated the proposed method with a mix reconstruction error. The result shows that the proposed method estimates the time-varying fader and equalizer gains more accurately than existing methods and simple baselines.
Download Notes on the use of Variational Autoencoders for Speech and Audio Spectrogram Modeling Variational autoencoders (VAEs) are powerful (deep) generative artificial neural networks. They have been recently used in several papers for speech and audio processing, in particular for the modeling of speech/audio spectrograms. In these papers, very poor theoretical support is given to justify the chosen data representation and decoder likelihood function or the corresponding cost function used for training the VAE. Yet, a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization (NMF) of audio spectrograms and its application to audio source separation. In the present paper, we show how this statistical framework applies to VAE-based speech/audio spectrogram modeling. This provides the latter insights on the choice and interpretability of data representation and model parameterization.