Download Online Real-time Onset Detection with Recurrent Neural Networks We present a new onset detection algorithm which operates online in real time without delay. Our method incorporates a recurrent neural network to model the sequence of onsets based solely on causal audio signal information. Comparative performance against existing state-of-the-art online and offline algorithms was evaluated using a very large database. The new method – despite being an online algorithm – shows performance only slightly short of the best existing offline methods while outperforming standard approaches.
Download Unsupervised Feature Learning for Speech and Music Detection in Radio Broadcasts Detecting speech and music is an elementary step in extracting information from radio broadcasts. Existing solutions either rely on general-purpose audio features, or build on features specifically engineered for the task. Interpreting spectrograms as images, we can apply unsupervised feature learning methods from computer vision instead. In this work, we show that features learned by a mean-covariance Restricted Boltzmann Machine partly resemble engineered features, but outperform three hand-crafted feature sets in speech and music detection on a large corpus of radio recordings. Our results demonstrate that unsupervised learning is a powerful alternative to knowledge engineering.
Download A jump start for NMF with N-FINDR and NNLS Nonnegative Matrix Factorization is a popular tool for the analysis of audio spectrograms. It is usually initialized with random data, after which it iteratively converges to a local optimum. In this paper we show that N-FINDR and NNLS, popular techniques for dictionary and activation matrix learning in remote sensing, prove useful to create a better starting point for NMF. This reduces the number of iterations necessary to come to a decomposition of similar quality. Adapting algorithms from the hyperspectral image unmixing and remote sensing communities, provides an interesting direction for future research in audio spectrogram factorization.
Download Music Emotion Classification: Dataset Acquisition And Comparative Analysis In this paper we present an approach to emotion classification in audio music. The process is conducted with a dataset of 903 clips and mood labels, collected from Allmusic1 database, organized in five clusters similar to the dataset used in the MIREX2 Mood Classification Task. Three different audio frameworks – Marsyas, MIR Toolbox and Psysound, were used to extract several features. These audio features and annotations are used with supervised learning techniques to train and test various classifiers based on support vector machines. To access the importance of each feature several different combinations of features, obtained with feature selection algorithms or manually selected were tested. The performance of the solution was measured with 20 repetitions of 10-fold cross validation, achieving a F-measure of 47.2% with precision of 46.8% and recall of 47.6%.
Download Multi-channel Audio Information Hiding We consider a method of hiding many audio channels in one host signal. The purpose of this is to provide a ‘mix’ that incorporates information on all the channels used to produce it, thereby allowing all, or, at least some channels to be stored in the mix for later use (e.g. for re-mixing and/or archiving). After providing an overview of some recently published audio water marking schemes in the time and transform domains, we present a method that is based on using a four least significant bits scheme to embed five MP3 files into a single 16-bit host WAV file without incurring any perceptual audio distortions in either the host data or embedded files. The host WAV file is taken to be the final mix associated with the original multi-channel data before applying minimal MP3 compression (WAV to MP3 conversion), or, alternatively, an arbitrary host WAV file into which other multi-channel data in an MP3 format is hidden. The embedded information can be encrypted and/or the embedding locations randomized on a channelby-channel basis depending on the security protocol desired by the user. The method is illustrated by providing example m-code for interested readers to reproduce the results obtained to date and as a basis for further development.
Download Drumkit Transcription via Convolutive NMF Audio to midi software exists for transcribing the output of a multimic’ed drumkit. Such software requires that the drummer uses multiple microphones to capture a single stream of audio for each kit piece. This paper explores the first steps towards a system for transcribing a drum score based upon the input of a single mono microphone. Non-negative Matrix Factorisation is a widely researched source separation technique. We describe a system for transcribing drums using this technique presenting an improved gains update method. A good level of accuracy is achieved on on complex loops and there are indications the mis-transcriptions are for perceptually less important parts of the score.
Download Effective Separation of Low-Pitch Notes Using NMF Using Non-Power-of-2 Discrete Fourier Transforms Recently, non-negative matrix factorization (NMF), which is applied to decompose signals in frequency domain by means of short-time Fourier transform (STFT), is widely used in audio source separation. Separation of low-pitch notes in recordings is of significant interest. According to time-frequency uncertainty principle, it may suffer from the tradeoff between time and frequency localizations for low-pitch sounds. Furthermore, because the window function applied to the signal causes frequency spreading, separation of low-pitch notes becomes more difficult. Instead of using power-of-2 FFT, we experiment on STFT sizes corresponding to the pitches of the notes in the signals. Computer simulations using synthetic signals show that the Source to Interferences Ratio (SIR) is significantly improved without sacrificing Sources to Artifacts Ratio (SAR) and Source to Distortion Ratio (SDR). In average, at least 2 to 6 dB improvement in SIR is achieved when compared to power-of-2 FFT of similar sizes.
Download Shifted NMF with Group Sparsity for Clustering NMF Basis Functions Recently, Non-negative Matrix Factorisation (NMF) has found application in separation of individual sound sources. NMF decomposes the spectrogram of an audio mixture into an additive parts based representation where the parts typically correspond to individual notes or chords. However, there is a need to cluster the NMF basis functions to their sources. Although, many attempts have been made to improve the clustering of the basis functions to sources, much research is still required in this area. Recently, Shifted Non-negative Matrix Factorisation (SNMF) was used to cluster these basis functions. To this end, we propose that the incorporation of group sparsity to the Shifted NMF based methods may benefit the clustering algorithms. We have tested this on SNMF algorithms with improved separation quality. Results show that this gives improved clustering of pitched basis functions over previous methods.
Download Real-time Finite Difference Physical Models of Musical Instruments on a Field Programmable Gate Array (FPGA) Real-time sound synthesis of musical instruments based on solving differential equations is of great interest in Musical Acoustics especially in terms of linking geometry features of musical instruments to sound features. A major restriction of accurate physical models is the computational effort. One could state that the calculation cost is directly linked to the geometrical and material accuracy of a physical model and so to the validity of the results. This work presents a methodology for implementing realtime models of whole instrument geometries modelled with the Finite Differences Method (FDM) on a Field Programmable Gate Array (FPGA), a device capable of massively parallel computations. Examples of three real-time musical instrument implementations are given, a Banjo, a Violin and a Chinese Ruan.
Download Characterisation of Acoustic Scenes Using a Temporally-constrained Shift-invariant Model In this paper, we propose a method for modeling and classifying acoustic scenes using temporally-constrained shift-invariant probabilistic latent component analysis (SIPLCA). SIPLCA can be used for extracting time-frequency patches from spectrograms in an unsupervised manner. Component-wise hidden Markov models are incorporated to the SIPLCA formulation for enforcing temporal constraints on the activation of each acoustic component. The time-frequency patches are converted to cepstral coefficients in order to provide a compact representation of acoustic events within a scene. Experiments are made using a corpus of train station recordings, classified into 6 scene classes. Results show that the proposed model is able to model salient events within a scene and outperforms the non-negative matrix factorization algorithm for the same task. In addition, it is demonstrated that the use of temporal constraints can lead to improved performance.