Download Analysis and Trans-synthesis of Acoustic Bowed-String Instrument Recordings: a Case Study using Bach Cello Suites
In this paper, analysis and trans-synthesis of acoustic bowed string instrument recordings with new non-negative matrix factorization (NMF) procedure are presented. This work shows that it may require more than one template to represent a note according to time-varying behavior of timbre, especially played by bowed string instruments. The proposed method improves original NMF without the knowledge of tone models and the number of required templates in advance. Resultant NMF information is then converted into the synthesis parameters of the sinusoidal synthesis. Bach cello suites recorded by Fournier and Starker are used in the experiments. Analysis and trans-synthesis examples of the recordings are also provided. Index Terms—trans-synthesis, non-negative matrix factorization, bowed string instrument
Download Drumkit Transcription via Convolutive NMF
Audio to midi software exists for transcribing the output of a multimic’ed drumkit. Such software requires that the drummer uses multiple microphones to capture a single stream of audio for each kit piece. This paper explores the first steps towards a system for transcribing a drum score based upon the input of a single mono microphone. Non-negative Matrix Factorisation is a widely researched source separation technique. We describe a system for transcribing drums using this technique presenting an improved gains update method. A good level of accuracy is achieved on on complex loops and there are indications the mis-transcriptions are for perceptually less important parts of the score.
Download Sound Source Separation: Azimuth Discrimination and Resynthesis
In this paper we present a novel sound source separation algorithm which requires no prior knowledge, no learning, assisted or otherwise, and performs the task of separation based purely on azimuth discrimination within the stereo field. The algorithm exploits the use of the pan pot as a means to achieve image localisation within stereophonic recordings. As such, only an interaural intensity difference exists between left and right channels for a single source. We use gain scaling and phase cancellation techniques to expose frequency dependent nulls across the azimuth domain, from which source separation and resynthesis is carried out. We present results obtained from real recordings, and show that for musical recordings, the algorithm improves upon the output quality of current source separation schemes.
Download Effective Singing Voice Detection in Popular Music Using ARMA Filtering
Locating singing voice segments is essential for convenient indexing, browsing and retrieval large music archives and catalogues. Furthermore, it is beneficial for automatic music transcription and annotations. The approach described in this paper uses Mel-Frequency Cepstral Coefficients in conjunction with Gaussian Mixture Models for discriminating two classes of data (instrumental music and singing voice with music background). Due to imperfect classification behavior, the categorization without additional post-processing tends to alternate within a very short time span, whereas singing voice tends to be continuous for several frames. Thus, various tests have been performed to identify a suitable decision function and corresponding smoothing methods. Results are reported by comparing the performance of straightforward likelihood based classifications vs. postprocessing with an autoregressive moving average filtering method.
Download The REACTION System: Automatic Sound Segmentation and Word Spotting for Verbal Reaction Time Tests
Reaction tests are typical tests from the field of psychological research and communication science in which a test person is presented some stimulus like a photo, a sound, or written words. The individual has to evaluate the stimulus as fast as possible in a predefined manner and has to react by presenting the result of the evaluation. This could be by pushing a button in simple reaction tests or by saying an answer in verbal reaction tests. The reaction time between the onset of the stimulus and the onset of the response can be used as a degree of difficulty for performing the given evaluation. Compared to simple reaction tests verbal reaction tests are very powerful since the individual can simply say the answer which is the most natural way of answering. The drawback for verbal reaction tests is that today the reaction times still have to be determined manually. This means that a person has to listen through all audio recordings taken during test sessions and mark stimuli times and word beginnings one by one which is very time consuming and people-intensive. To replace the manual evaluation of reaction tests this article presents the REACTION (Reaction Time Determination) system which can automatically determine the reaction times of a test session by analyzing the audio recording of the session. The system automatically detects the onsets of stimuli as well as the onsets of answers. The recording is furthermore segmented into parts each containing one stimulus and the following reaction which further facilitates the transcription of the spoken words for a semantic evaluation.
Download Identifying function-specific prosodic cues for non-speech user interface sound design
This study explores the potential of utilising certain prosodic qualities of function-specific vocal expressions in order to design effective non-speech user interface sounds. In an empirical setting, utterances with four context-situated communicative functions were produced by 20 participants. Time series of fundamental frequency (F0 ) and intensity were extracted from the utterances and analysed statistically. The results show that individual communicative functions have distinct prosodic characteristics that can be statistically modelled. By using the model, certain function-specific prosodic cues can be identified and, in turn, imitated in the design of communicative interface sounds for the corresponding communicative functions in human-computer interaction.
Download Singing Voice Separation Based on Non-Vocal Independent Component Subtraction and Amplitude Discrimination
Many applications of Music Information Retrieval can benefit from effective isolation of the music sources. Earlier work by the authors led to the development of a system that is based on Azimuth Discrimination and Resynthesis (ADRess) and can extract the singing voice from reverberant stereophonic mixtures. We propose an extension to our previous method that is not based on ADRess and exploits both channels of the stereo mix more effectively. For the evaluation of the system we use a dataset that contains songs convolved during mastering as well as the mixing process (i.e. “real-world” conditions). The metrics for objective evaluation are based on bss_eval.
Download Model-based synthesis and transformation of voiced sounds
In this work a glottal model loosely based on the Ishizaka and Flanagan model is proposed, where the number of parameters is drastically reduced. First, the glottal excitation waveform is estimated, together with the vocal tract filter parameters, using inverse filtering techniques. Then the estimated waveform is used in order to identify the nonlinear glottal model, represented by a closedloop configuration of two blocks: a second order resonant filter, tuned with respect to the signal pitch, and a regressor-based functional, whose coefficients are estimated via nonlinear identification techniques. The results show that an accurate identification of real data can be achieved with less than regressors of the nonlinear functional, and that an intuitive control of fundamental features, such as pitch and intensity, is allowed by acting on the physically informed parameters of the model. 10
Download Separation Of Speech Signal From Complex Auditory Scenes
The hearing system, even in front of complex auditory scenes and in unfavourable conditions, is able to separate and recognize auditory events accurately. A great deal of effort has gone into the understanding of how, after having captured the acoustical data, the human auditory system processes them. The aim of this work is the digital implementation of the decomposition of a complex sound in separate parts as it would appear to a listener. This operation is called signal separation. In this work, the separation of speech signal from complex auditory scenes has been studied and an experimentation of the techniques that address this problem has been done.
Download A New Criterion and Associated Bit Allocation Method for Current Audio Coding Standards
This paper presents a new noise-shaping criterion. Based on the new criterion, we derive an efficient bit allocation method. The bit allocation method is applicable to the current audio standards like MPEG1 Layer 3 and MPEG4 AAC. The bit allocation method has gained a speed up for more than ten and has resulted in better quality over the traditional two nested loop method presented in ISO draft. The experiments illustrated the correction of the objective measurement criterion and the new allocation has shown the deterministic method instead of the iteration method to achieve the high allocation efficiency and best quality.