Download System analysis and performance tuning for broadcast audio fingerprinting An audio fingerprint is a content-based compact signature that summarizes an audio recording. Audio Fingerprinting technologies have recently attracted attention since they allow the monitoring of audio independently of its format and without the need of meta-data or watermark embedding. These technologies need to face channel robustness as well as system accuracy and scalability to succeed on real audio broadcasting environments. This paper presents a complete audio fingerprinting system for audio broadcasting monitoring that satisfies the above system requirements. The system performance is enhanced with four proposals that required detailed analysis of the system blocks as well as extense system tuning experiments.
Download Frequency-domain techniques for high-quality voice modification This paper presents new frequency-domain voice modification techniques that combine the high-quality usually obtained by timedomain techniques such as TD-PSOLA with the flexibility provided by the frequency-domain representation. The technique only works for monophonic sources (single-speaker), and relies on a (possibly online) pitch detection. Based on the pitch, and according to the desired pitch and formant modifications, individual harmonics are selected and shifted to new locations in the spectrum. The harmonic phases are updated according to a pitchbased method that aims to achieve time-domain shape-invariance, thereby reducing or eliminating the usual artifacts associated with frequency-domain and sinusoidal-based voice modification techniques. The result is a fairly inexpensive, flexible algorithm which is able to match the quality of time-domain techniques, but provides vastly improved flexibility in the array of available modifications.
Download Content-based melodic transformations of audio material for a music processing application This paper presents an application for performing melodic transformations to monophonic audio phrases. The system first extracts a melodic description from the audio. This description is presented to the user and can be stored and loaded in a MPEG-7 based format. A set of high-level transformations can then be applied to the melodic description. These high-level transformations are mapped into a set of low-level signal transformations and then applied to the audio signal. The algorithms for description extraction and audio transformation are also presented.
Download An efficient audio time-scale modification algorithm for use in a subband implementation The PAOLA algorithm is an efficient algorithm for the timescale modification of speech. It uses a simple peak alignment technique to synchronise synthesis frames and takes waveform properties and the desired time-scale factor into account to determine optimum algorithm parameters. However, PAOLA has difficulties with certain waveform types and can result in poor synchronisation for subband implementations. SOLA is a less efficient algorithm but resolves the issues associated with PAOLA’s implementation. We present an algorithm that is a combination of the two approaches that proves to be an efficient and effective algorithm for a subband implementation.
Download A new approach to transient processing in the phase vocoder In this paper we propose a new method to reduce phase vocoder artifacts during attack transients. In contrast to all transient preservation algorithms that have been proposed up to now the new approach does not impose any constraints on the time dilation parameter for processing transient segments. By means of an investigation into the spectral properties of attack transients of simple sinusoids we provide new insights into the causes of phase vocoder artifacts and propose a new method for transient preservation as well as a new criterion and a new algorithm for transient detection. Both, the transient detection and the transient processing algorithms are designed to operate on the level of spectral bins which reduces possible artifacts in stationary signal components that are close to the spectral peaks classified as transient. The transient detection criterion has a close relation to the transient position and allows us to find an optimal position for reinitializing the phase spectrum. The evaluation of the transient detector by means of a hand labeled data base demonstrates its superior performance compared to a previously published algorithm. Attack transients in sound signals transformed with the new algorithm achieves high quality even if strong dilation is applied to polyphonic signals.
Download Description-driven context-sensitive effects We introduce a new paradigm in digital audio effects that i s based on more symbolic manipulations of elements of a sound, rather than using linear signal processing alone. By utilising content descriptions such as those enabled b y MPEG-7, a system may apply context-sensitive effects that are more aware of the structure of the sound than current systems. We advocate a standards-based approach (with MPEG-4, 7, and -21) so as to maximise the interoperability between different systems. The paper outlines MPEG-7 description structures that may be used as the basis for controlling and triggering effects in a system. It explores the different possibilities that are opened up by this paradigm. The way i s then pointed towards more sophisticated control structures that may lead to more “musical” and dynamic effects.
Download Multimodal Interfaces for Expressive Sound Control This paper introduces research issues on multimodal interaction and interfaces for expressive sound control. We introduce Multisensory Integrated Expressive Environments (MIEEs) as a framework for Mixed Reality applications in the performing arts. Paradigmatic contexts for applications of MIEEs are multimedia concerts, interactive dance / music / video installations, interactive museum exhibitions, distributed cooperative environments for theatre and artistic expression. MIEEs are user-centred systems able to interpret the high-level information conveyed by performers through their expressive gestures and to establish an effective multisensory experience taking into account expressive, emotional, affective content. The lecture discusses some main issues for MIEEs and presents the EyesWeb (www.eyesweb.org) open software platform which has been recently redesigned (version 4) in order to better address MIEE requirements. Short live demonstrations are also presented.
Download The Sounding Gesture: An Overview Sound control by gesture is a peculiar topic in Human-Computer Interaction: many different approaches to it are available, focusing each time on diversified perspectives. Our point of view is an interdisciplinary one: taking into account technical considerations about control theory and sound processing, we try to explore the expressiveness world which is closer to psychology theories. Starting from a state of the art which outlines two main approaches to the problem of ”making sound with gestures”, we will delve into psychological theories about expressiveness, describing in particular possible applications dealing with intermodality and mixed reality environments related to the Gestalt Theory. HCI design can indeed benefit from this kind of approach because of the quantitative methods that can be applied to measure expressiveness. Interfaces can be used in order to convey expressiveness, which is a plus of information that can help interacting with the machine; this kind of information can be coded as spatio-temporal schemes, as it is stated in Gestalt theory.
Download Effect of Latency on Playing Accuracy of Two Gesture Controlled Continuous Sound Instruments Without Tactile Feedback The paper reports results from an experimental study quantifying how latency affects the playing accuracy of two continuous sound instruments. 11 subjects played a conventional Theremin and a virtual reality Theremin. Both instruments provided the user only audio feedback. The subjects performed two tasks under different instrument latencies. They attempted to match the pitch of the instrument to a sample pitch and they played along a short sample melody and a metronome. Both the sample sound and the instrument’s sound were recorded on different channels of a sound file. Later the pitch of the sounds was extracted and user performance analyzed. The results show that the time required to match a given pitch degrades about five times the introduced latency suggesting that the feedback latency cumulates over the whole task. Errors while playing along a sample melody increased 80% by average on the highest latency of 240ms. Latencies until 120ms increased the errors only slightly.
Download Audio-Based Gesture Extraction on the ESITAR Controller Using sensors to extract gestural information for control parameters of digital audio effects is common practice. There has also been research using machine learning techniques to classify specific gestures based on audio feature analysis. In this paper, we will describe our experiments in training a computer to map the appropriate audio-based features to look like sensor data, in order to potentially eliminate the need for sensors. Specifically, we will show our experiments using the ESitar, a digitally enhanced sensor based controller modeled after the traditional North Indian sitar. We utilize multivariate linear regression to map continuous audio features to continuous gestural data.