Download Unsupervised Feature Learning for Speech and Music Detection in Radio Broadcasts
Detecting speech and music is an elementary step in extracting information from radio broadcasts. Existing solutions either rely on general-purpose audio features, or build on features specifically engineered for the task. Interpreting spectrograms as images, we can apply unsupervised feature learning methods from computer vision instead. In this work, we show that features learned by a mean-covariance Restricted Boltzmann Machine partly resemble engineered features, but outperform three hand-crafted feature sets in speech and music detection on a large corpus of radio recordings. Our results demonstrate that unsupervised learning is a powerful alternative to knowledge engineering.
Download Music Emotion Classification: Dataset Acquisition And Comparative Analysis
In this paper we present an approach to emotion classification in audio music. The process is conducted with a dataset of 903 clips and mood labels, collected from Allmusic1 database, organized in five clusters similar to the dataset used in the MIREX2 Mood Classification Task. Three different audio frameworks – Marsyas, MIR Toolbox and Psysound, were used to extract several features. These audio features and annotations are used with supervised learning techniques to train and test various classifiers based on support vector machines. To access the importance of each feature several different combinations of features, obtained with feature selection algorithms or manually selected were tested. The performance of the solution was measured with 20 repetitions of 10-fold cross validation, achieving a F-measure of 47.2% with precision of 46.8% and recall of 47.6%.
Download A jump start for NMF with N-FINDR and NNLS
Nonnegative Matrix Factorization is a popular tool for the analysis of audio spectrograms. It is usually initialized with random data, after which it iteratively converges to a local optimum. In this paper we show that N-FINDR and NNLS, popular techniques for dictionary and activation matrix learning in remote sensing, prove useful to create a better starting point for NMF. This reduces the number of iterations necessary to come to a decomposition of similar quality. Adapting algorithms from the hyperspectral image unmixing and remote sensing communities, provides an interesting direction for future research in audio spectrogram factorization.
Download A Simple and Effective Spectral Feature for Speech Detection in Mixed Audio Signals
We present a simple and intuitive spectral feature for detecting the presence of spoken speech in mixed (speech, music, arbitrary sounds and noises) audio signals. The feature is based on some simple observations about the appearance, in signals that contain speech, of harmonics with characteristic trajectories. Experiments with some 70 hours of radio broadcasts in five different languages demonstrate that the feature is very effective in detecting and delineating segments that contain speech, and that it also seems to be quite general and robust w.r.t. different languages.
Download Voice Features For Control: A Vocalist Dependent Method For Noise Measurement And Independent Signals Computation
Information about the human spoken and singing voice is conveyed through the articulations of the individual’s vocal folds and vocal tract. The signal receiver, either human or machine, works at different levels of abstraction to extract and interpret only the relevant context specific information needed. Traditionally in the field of human machine interaction, the human voice is used to drive and control events that are discrete in terms of time and value. We propose to use the voice as a source of realvalued and time-continuous control signals that can be employed to interact with any multidimensional human-controllable device in real-time. The isolation of noise sources and the independence of the control dimensions play a central role. Their dependency on individual voice represents an additional challenge. In this paper we introduce a method to compute case specific independent signals from the vocal sound, together with an individual study of features computation and selection for noise rejection.
Download Binaural In-Ear Monitoring of Acoustic Instruments in Live Music Performance
A method for Binaural In-Ear Monitoring (Binaural IEM) of acoustic instruments in live music is presented. Spatial rendering is based on four considerations: the directional radiation patterns of musical instruments, room acoustics, binaural synthesis with Head-Related Transfer Functions (HRTF), and the movements of both the musician’s head and instrument. The concepts of static and dynamic sound mixes are presented and discussed according to the emotional involvement and musical instruments of the performers, as well as the use of motion capture technology. Pilot experiments of BIEM with dynamic mixing were done with amateur musicians performing with wireless headphones and a motion capture system in a small room. Listening tests with professional musicians evaluating recordings under conditions of dynamic sound mixing were carried out, attempting to find an initial reaction to BIEM. Ideas for further research in static sound mixing, individualized HRTFs, tracking techniques, as well as wedge-monitoring schemes are suggested.
Download Online Real-time Onset Detection with Recurrent Neural Networks
We present a new onset detection algorithm which operates online in real time without delay. Our method incorporates a recurrent neural network to model the sequence of onsets based solely on causal audio signal information. Comparative performance against existing state-of-the-art online and offline algorithms was evaluated using a very large database. The new method – despite being an online algorithm – shows performance only slightly short of the best existing offline methods while outperforming standard approaches.
Download Sparse Decomposition, Clustering and Noise for Fire Texture Sound Re-Synthesis
In this paper we introduce a framework that represents environmental texture sounds as a linear superposition of independent foreground and background layers that roughly correspond to entities in the physical production of the sound. Sound samples are decomposed into a sparse representation with the matching pursuit algorithm and a dictionary of Daubechies wavelet atoms. An agglomerative clustering procedure groups atoms into short transient molecules. A foreground layer is generated by sampling these sound molecules from a distribution, whose parameters are estimated from the input sample. The residual signal is modelled by an LPC-based source-filter model, synthesizing the background sound layer. The capability of the system is demonstrated with a set of fire sounds.
Download Digital Audio Effects on Mobile Platforms
This paper discusses the development of digital audio effect applications in mobile platforms. It introduces the Mobile Csound Platform (MCP) as an agile development kit for audio programming in such environments. The paper starts by exploring the basic technology employed: the Csound Application Programming Interface (API), the target systems (iOS and Android) and their support for realtime audio. CsoundObj, the fundamental class in the MCP toolkit is introduced and explored in some detail. This is followed by a discussion of its implementation in Objective-C for iOS and Java for Android. A number of application scenarios are explored and the paper concludes with a general discussion of the technology and its potential impact for audio effects development.
Download Characterisation of Acoustic Scenes Using a Temporally-constrained Shift-invariant Model
In this paper, we propose a method for modeling and classifying acoustic scenes using temporally-constrained shift-invariant probabilistic latent component analysis (SIPLCA). SIPLCA can be used for extracting time-frequency patches from spectrograms in an unsupervised manner. Component-wise hidden Markov models are incorporated to the SIPLCA formulation for enforcing temporal constraints on the activation of each acoustic component. The time-frequency patches are converted to cepstral coefficients in order to provide a compact representation of acoustic events within a scene. Experiments are made using a corpus of train station recordings, classified into 6 scene classes. Results show that the proposed model is able to model salient events within a scene and outperforms the non-negative matrix factorization algorithm for the same task. In addition, it is demonstrated that the use of temporal constraints can lead to improved performance.