Download Multiple-F0 tracking based on a high-order HMM model
This paper is about multiple-F0 tracking and the estimation of the number of harmonic source streams in music sound signals. A source stream is understood as generated from a note played by a musical instrument. A note is described by a hidden Markov model (HMM) having two states: the attack state and the sustain state. It is proposed to first perform the tracking of F0 candidates using a high-order hidden Markov model, based on a forward-backward dynamic programming scheme. The propagated weights are calculated in the forward tracking stage, followed by an iterative tracking of the most likely trajectories in the backward tracking stage. Then, the estimation of the underlying source streams is carried out by means of iteratively pruning the candidate trajectories in a maximum likelihood manner. The proposed system is evaluated by a specially constructed polyphonic music database. Compared with the frame-based estimation systems, the tracking mechanism improves significantly the accuracy rate.
Download Sound Analysis and Synthesis Adaptive in Time and Two Frequency Bands
We present an algorithm for sound analysis and resynthesis with local automatic adaptation of time-frequency resolution. There exists several algorithms allowing to adapt the analysis window depending on its time or frequency location; in what follows we propose a method which select the optimal resolution depending on both time and frequency. We consider an approach that we denote as analysis-weighting, from the point of view of Gabor frame theory. We analyze in particular the case of different adaptive timevarying resolutions within two complementary frequency bands; this is a typical case where perfect signal reconstruction cannot in general be achieved with fast algorithms, causing a certain error to be minimized. We provide examples of adaptive analyses of a music sound, and outline several possibilities that this work opens.
Download On the Use of Perceptual Properties for Melody Estimation
This paper is about the use of perceptual principles for melody estimation. The melody stream is understood as generated by the most dominant source. Since the source with the strongest energy may not be perceptually the most dominant one, it is proposed to study the perceptual properties for melody estimation: loudness, masking effect and timbre similarity. The related criteria are integrated into a melody estimation system and their respective contributions are evaluated. The effectiveness of these perceptual criteria is confirmed by the evaluation results using more than one hundred excerpts of music recordings.
Download Vivos Voco: A survey of recent research on voice transformations at IRCAM
IRCAM has a long experience in analysis, synthesis and transformation of voice. Natural voice transformations are of great interest for many applications and can be combine with text-to-speech system, leading to a powerful creation tool. We present research conducted at IRCAM on voice transformations for the last few years. Transformations can be achieved in a global way by modifying pitch, spectral envelope, durations etc. While it sacrifices the possibility to attain a specific target voice, the approach allows the production of new voices of a high degree of naturalness with different gender and age, modified vocal quality, or another speech style. These transformations can be applied in realtime using ircamTools TR A X.Transformation can also be done in a more specific way in order to transform a voice towards the voice of a target speaker. Finally, we present some recent research on the transformation of expressivity.
Download Transforming Vibrato Extend in Monophonic Sounds
This paper describes research into signal transformation operators allowing to modify the vibrato extent in recorded sound signals. A number of operators are proposed that deal with the problem taking into account different levels of complexity. The experimental validation shows that the operators are effective in removing existing vibrato in real world recordings at least for the idealized case of long notes and with properly segmented vibrato sections. It shows as well that for instruments with significant noise level (flute) independent treatment of noise and harmonic signal components is required.
Download On Stretching Gaussian Noises with the Phase Vocoder
Recently, the processing of non-sinusoidal signals, or sound textures, has become an important topic in various areas. In general, the transformation is done by the phase vocoder techniques. Since the phase vocoder technique is based on a sinusoidal model, it’s performance is not satisfying when applied to transform sound textures. The following article investigates into the problem using as example the most basic non-sinusoidal sounds, that are noise signals. We demonstrate the problems that arise when time stretching noise with the phase vocoder, provide a description of some relevant statistical properties of the time frequency representation of noise and introduce an algorithm that allows to preserve these statistical properties when time stretching noise with the phase vocoder. The resulting algorithm significantly improves the perceptual quality of the time stretched noise signals and therefore it is seen as a promising first step towards an algorithm for transformation of sound textures.
Download Concatenative Sound Texture Synthesis Methods and Evaluation
Concatenative synthesis is a practical approach to sound texture synthesis because of its nature in keeping realistic short-time signal characteristics. In this article, we investigate three concatenative synthesis methods for sound textures: concatenative synthesis with descriptor controls (CSDC), Montage synthesis (MS) and a new method called AudioTexture (AT). The respective algorithms are presented, focusing on the identification and selection of concatenation units. The evaluation demonstrates that the presented algorithms are of close performance in terms of quality and similarity compared to the reference original sounds.
Download Gradient Conversion Between Time and Frequency Domains Using Wirtinger Calculus
Gradient-based optimizations are commonly found in areas where Fourier transforms are used, such as in audio signal processing. This paper presents a new method of converting any gradient of a cost function with respect to a signal into, or from, a gradient with respect to the spectrum of this signal: thus, it allows the gradient descent to be performed indiscriminately in time or frequency domain. For efficiency purposes, and because the gradient of a real function with respect to a complex signal does not formally exist, this work is performed using Wirtinger calculus. An application to sound texture synthesis then experimentally validates this gradient conversion.
Download Automatic drum transcription with convolutional neural networks
Automatic drum transcription (ADT) aims to detect drum events in polyphonic music. This task is part of the more general problem of transcribing a music signal in terms of its musical score and additionally can be very interesting for extracting high level information e.g. tempo, downbeat, measure. This article has the objective to investigate the use of Convolutional Neural Networks (CNN) in the context of ADT. Two different strategies are compared. First an approach based on a CNN based detection of drum only onsets is combined with an algorithm using Non-negative Matrix Deconvolution (NMD) for drum onset transcription. Then an approach relying entirely on CNN for the detection of individual drum instruments is described. The question of which loss function is the most adapted for this task is investigated together with the question of the optimal input structure. All algorithms are evaluated using the publicly available ENST Drum database, a widely used established reference dataset, allowing easy comparison with other algorithms. The comparison shows that the purely CNN based algorithm significantly outperforms the NMD based approach, and that the results are significantly better for the snare drum, but slightly worse for both the bass drum and the hi-hat when compared to the best results published so far and ones using also a neural network model.
Download Sound texture synthesis using Convolutional Neural Networks
The following article introduces a new parametric synthesis algorithm for sound textures inspired by existing methods used for visual textures. Using a 2D Convolutional Neural Network (CNN), a sound signal is modified until the temporal cross-correlations of the feature maps of its log-spectrogram resemble those of a target texture. We show that the resulting synthesized sound signal is both different from the original and of high quality, while being able to reproduce singular events appearing in the original. This process is performed in the time domain, discarding the harmful phase recovery step which usually concludes synthesis performed in the time-frequency domain. It is also straightforward and flexible, as it does not require any fine tuning between several losses when synthesizing diverse sound textures. Synthesized spectrograms and sound signals are showcased, and a way of extending the synthesis in order to produce a sound of any length is also presented. We also discuss the choice of CNN, border effects in our synthesized signals and possible ways of modifying the algorithm in order to improve its current long computation time.