Download Combining classifications based on local and global features: application to singer identification
In this paper we investigate the problem of singer identification on acapella recordings of isolated notes. Most of studies on singer identification describe the content of signals of singing voice with features related to the timbre (such as MFCC or LPC). These features aim to describe the behavior of frequencies at a given instant of time (local features). In this paper, we propose to describe sung tone with the temporal variations of the fundamental frequency (and its harmonics) of the note. The periodic and continuous variations of the frequency trajectories are analyzed on the whole note and the features obtained reflect expressive and intonative elements of singing such as vibrato, tremolo and portamento. The experiments, conducted on two distinct data-sets (lyric and pop-rock singers), prove that the new set of features capture a part of the singer identity. However, these features are less accurate than timbre-based features. We propose to increase the recognition rate of singer identification by combining information conveyed by local and global description of notes. The proposed method, that shows good results, can be adapted for classification problem involving a large number of classes, or to combine classifications with different levels of performance.
Download Enhanced Beat Tracking with Context-Aware Neural Networks
We present two new beat tracking algorithms based on the autocorrelation analysis, which showed state-of-the-art performance in the MIREX 2010 beat tracking contest. Unlike the traditional approach of processing a list of onsets, we propose to use a bidirectional Long Short-Term Memory recurrent neural network to perform a frame by frame beat classification of the signal. As inputs to the network the spectral features of the audio signal and their relative differences are used. The network transforms the signal directly into a beat activation function. An autocorrelation function is then used to determine the predominant tempo to eliminate the erroneously detected - or complement the missing - beats. The first algorithm is tuned for music with constant tempo, whereas the second algorithm is further capable to follow changes in tempo and time signature.
Download On the Use of Perceptual Properties for Melody Estimation
This paper is about the use of perceptual principles for melody estimation. The melody stream is understood as generated by the most dominant source. Since the source with the strongest energy may not be perceptually the most dominant one, it is proposed to study the perceptual properties for melody estimation: loudness, masking effect and timbre similarity. The related criteria are integrated into a melody estimation system and their respective contributions are evaluated. The effectiveness of these perceptual criteria is confirmed by the evaluation results using more than one hundred excerpts of music recordings.
Download Black Box methodology for the characterization of Sample Rate Conversion systems
Digital systems dedicated to audio and speech processing usually require sample rate conversion units in order to adapt the sample rate from different signal flows: for instance 8 and 16 kHz for speech, 32 kHz for the broadcast rate, 44.1 kHz for CDs and 48 kHz for studio work. The designer chooses the sample rate conversion (SRC) technology based on objective criteria, such as figures of complexity, development or integration cycle and of course performance characterization. For linear time-invariant (LTI) systems, the transfer function contains most information necessary for the system characterization. However, being not LTI, the SRC characterization also requires aliasing characterization. When the system under study is available only through input excitations and output observations (i.e. in black box conditions), aliasing characterization obtained for instance through distortion measurements is difficult to evaluate properly. Furthermore, aliasing measurements can be messed up with weakly nonlinear artifacts, such as those due to internal rounding errors. Consider now the fractional SRC system as a linear periodically time-varying (LPTV) system whose characteristics describe simultaneously the aliasing and the in-band (so-called linear) behaviour from the SRC. An interesting and new compound system made of multiple instances of the same SRC system builds a LTI system. The linear features from this compound system fully characterizes the SRC (i.e. its linear and aliasing rejection behaviour) whereas weakly nonlinear features obtained from distortion measurements are only due to internal rounding errors. The SRC system can be analyzed in a black box condition, either in batch processing or real-time processing. Examples illustrate the capability of the method to fully recover characteristics from a multistage SRC system and to separate quantization effect and rounding noise in actual SRC implementations.
Download Optimal Filter Partitions for Real-Time FIR Filtering using Uniformly-Partitioned FFT-based Convolution in the Frequency-Domain
This paper concerns highly-efficient real-time FIR filtering with low input-to-output latencies. For this type of application, partitioned frequency-domain convolution algorithms are established methods, combining efficiency and the necessity of low latencies. Frequency-domain convolution realizes linear FIR filtering by means of circular convolution. Therefore, the frequency transform’s period must be allocated with input samples and filter coefficients, affecting the filter partitioning as can be found in many publications, is a transform size K=2B of two times the audio streaming block length B. In this publication we review this choice based on a generalized FFT-based fast convolution algorithm with uniform filter partitioning. The correspondence between FFT sizes, filter partitions and the resulting computational costs is examined. We present an optimization technique to determine the best FFT size. The resulting costs for stream filtering and filter transformations are discussed in detail. It is shown, that for real-time FIR filtering it is always beneficial to partition filters. Our results prove evidence that K=2B is a good choice, but they also show that an optimal FFT size can achieve a significant speedup for long filters and low latencies. Keywords: Real-time filtering, Fast convolution, Partitioned convolution, Optimal filter partitioning
Download A Simple Digital Model of the Diode-Based Ring-Modulator
The analog diode-based ring modulator has a distinctive sound quality compared to standard digital ring modulation, due to the non-linear behaviour of the diodes. It would be desirable to be able to recreate this sound in a digital context, for musical uses. However, the topology of the standard circuit for a diode-based ring modulator can make the process of modelling complex and potentially computationally heavy. In this work, we examine the behaviour of the standard diode ring modulator circuit, and propose a number of simplifications that maintain the important behaviour but are simpler to analyse. From these simplified circuits, we derive a simple and efficient digital model of the diode-based ring modulator based on a small network of static non-linearities. We propose a model for the non-linearities, along with parameterisations that allow the sound and behaviour to be modified dynamically for musical uses.
Download Gestural Auditory and Visual Interactive Platform
This paper introduces GAVIP, an interactive and immersive platform allowing for audio-visual virtual objects to be controlled in real-time by physical gestures and with a high degree of intermodal coherency. The focus is particularly put on two scenarios exploring the interaction between a user and the audio, visual, and spatial synthesis of a virtual world. This platform can be seen as an extended virtual musical instrument that allows an interaction with three modalities: the audio, visual and spatial modality. Intermodal coherency is thus of particular importance in this context. Possibilities and limitations offered by the two developed scenarios are discussed and future work presented.
Download Time-Variant Delay Effects based on Recurrence Plots
Recurrence plots (RPs) are two-dimensional binary matrices used to represent patterns of recurrence in time series data, and are typically used to analyze the behavior of non-linear dynamical systems. In this paper, we propose a method for the generation of time-variant delay effects in which the recurrences in an RP are used to restructure an audio buffer. We describe offline and realtime systems based on this method, and a realtime implementation for the Max/MSP environment in which the user creates an RP graphically. In addition, we discuss the use of gestural data to generate an RP, suggesting a potential extension to the system. The graphical and gestural interfaces can provide an intuitive and convenient way to control a time varying delay.
Download A Sound Localization based Interface for Real-Time Control of Audio Processing
This paper describes the implementation of an innovative musical interface based on the sound localization capability of a microphone array. Our proposal is to allow a musician to plan and conduct the expressivity of a performance, by controlling in realtime an audio processing module through the spatial movement of a sound source, i.e. voice, traditional musical instruments, sounding mobile devices. The proposed interface is able to locate and track the sound in a two-dimensional space with accuracy, so that the x-y coordinates of the sound source can be used to control the processing parameters. In particular, the paper is focused on the localization and tracking of harmonic sound sources in real moderate reverberant and noisy environment. To this purpose, we designed a system based on adaptive parameterized Generalized Cross-Correlation (GCC) and Phase Transform (PHAT) weighting with Zero-Crossing Rate (ZCR) threshold, a Wiener filter to improve the Signal to Noise Ratio (SNR) and a Kalman filter to make the position estimation more robust and accurate. We developed a Max/MSP external objects to test the system in a real scenario and to validate its usability.
Download Similarity-based Sound Source Localization with a Coincident Microphone Array
This paper presents a robust, accurate sound source localization method using a compact, near-coincident microphone array. We derive features by combining the microphone signals and determine the direction of a single sound source by similarity matching. Therefore, the observed features are compared with a set of previously measured reference features, which are stored in a look-up table. By proper processing in the similarity domain, we are able to deal with signal pauses and low SNR without the need of a separate detection algorithm. For practical evaluation, we made recordings of speech signals (both loudspeaker-playback and human speaker) with a planar 4-channel prototype array in a medium-sized room. The proposed approach clearly outperforms existing coincident localization methods. We achieve high accuracy (2◦ mean absolute azimuth error at 0 dB SNR) for static sources, while being able to quickly follow rapid source angle changes.