Download Improving intelligibility prediction under informational masking using an auditory saliency model The reduction of speech intelligibility in noise is usually dominated by energetic masking (EM) and informational masking (IM). Most state-of-the-art objective intelligibility measures (OIM) estimate intelligibility by quantifying EM. Few measures model the effect of IM in detail. In this study, an auditory saliency model, which intends to measure the probability of the sources obtaining auditory attention in a bottom-up process, was integrated into an OIM for improving the performance of intelligibility prediction under IM. While EM is accounted for by the original OIM, IM is assumed to arise from the listener’s attention switching between the target and competing sounds existing in the auditory scene. The performance of the proposed method was evaluated along with three reference OIMs by comparing the model predictions to the listener word recognition rates, for different noise maskers, some of which introduce IM. The results shows that the predictive accuracy of the proposed method is as good as the best reported in the literature. The proposed method, however, provides a physiologically-plausible possibility for both IM and EM modelling.
Download Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics Timbre spaces have been used in music perception to study the perceptual relationships between instruments based on dissimilarity ratings. However, these spaces do not generalize to novel examples and do not provide an invertible mapping, preventing audio synthesis. In parallel, generative models have aimed to provide methods for synthesizing novel timbres. However, these systems do not provide an understanding of their inner workings and are usually not related to any perceptually relevant information. Here, we show that Variational Auto-Encoders (VAE) can alleviate all of these limitations by constructing generative timbre spaces. To do so, we adapt VAEs to learn an audio latent space, while using perceptual ratings from timbre studies to regularize the organization of this space. The resulting space allows us to analyze novel instruments, while being able to synthesize audio from any point of this space. We introduce a specific regularization allowing to enforce any given similarity distances onto these spaces. We show that the resulting space provide almost similar distance relationships as timbre spaces. We evaluate several spectral transforms and show that the Non-Stationary Gabor Transform (NSGT) provides the highest correlation to timbre spaces and the best quality of synthesis. Furthermore, we show that these spaces can generalize to novel instruments and can generate any path between instruments to understand their timbre relationships. As these spaces are continuous, we study how audio descriptors behave along the latent dimensions. We show that even though descriptors have an overall non-linear topology, they follow a locally smooth evolution. Based on this, we introduce a method for descriptor-based synthesis and show that we can control the descriptors of an instrument while keeping its timbre structure.
Download Hard real-time onset detection of percussive instruments To date, the most successful onset detectors are those based on frequency representation of the signal. However, for such methods the time between the physical onset and the reported one is unpredictable and may largely vary according to the type of sound being analyzed. Such variability and unpredictability of spectrum-based onset detectors may not be convenient in some real-time applications. This paper proposes a real-time method to improve the temporal accuracy of state-of-the-art onset detectors. The method is grounded on the theory of hard real-time operating systems where the result of a task must be reported at a certain deadline. It consists of the combination of a time-base technique (which has a high degree of accuracy in detecting the physical onset time but is more prone to false positives and false negatives) with a spectrum-based technique (which has a high detection accuracy but a low temporal accuracy). The developed hard real-time onset detector was tested on a dataset of single non-pitched percussive sounds using the high frequency content detector as spectral technique. Experimental validation showed that the proposed approach was effective in better retrieving the physical onset time of about 50% of the hits detected by the spectral technique, with an average improvement of about 3 ms and maximum one of about 12 ms. The results also revealed that the use of a longer deadline may capture better the variability of the spectral technique, but at the cost of a bigger latency.