Download Music-Content-Adaptive Robust Principal Component Analysis for a Semantically Consistent Separation of Foreground and Background in Music Audio Signals Robust Principal Component Analysis (RPCA) is a technique to decompose signals into sparse and low rank components, and has recently drawn the attention of the MIR field for the problem of separating leading vocals from accompaniment, with appealing results obtained on small excerpts of music. However, the performance of the method drops when processing entire music tracks. We present an adaptive formulation of RPCA that incorporates music content information to guide the decomposition. Experiments on a set of complete music tracks of various genres show that the proposed algorithm is able to better process entire pieces of music that may exhibit large variations in the music content, and compares favorably with the state-of-the-art.
Download A Deep Learning Approach to the Prediction of Time-Frequency Spatial Parameters for Use in Stereo Upmixing This paper presents a deep learning approach to parametric timefrequency parameter prediction for use within stereo upmixing algorithms. The approach presented uses a Multi-Channel U-Net with Residual connections (MuCh-Res-U-Net) trained on a novel dataset of stereo and parametric time-frequency spatial audio data to predict time-frequency spatial parameters from a stereo input signal for positions on a 50-point Lebedev quadrature sampled sphere. An example upmix pipeline is then proposed which utilises the predicted time-frequency spatial parameters to both extract and remap stereo signal components to target spherical harmonic components to facilitate the generation of a full spherical representation of the upmixed sound field.
Download Real-Time Audio Visualization With Reassigned Non-uniform Filter Banks Filter banks, both uniform and non-uniform, are widely used for signal analysis and processing. However, the application of a timefrequency localized filter inevitably causes some amount of spectral and temporal leakage that, simultaneously, cannot be arbitrarily reduced. Reassignment is a classical procedure to eliminate this leakage in short-time Fourier spectrograms, thereby providing a sharper, more exact time-frequency domain signal representation. The reassignment technique was recently generalized to general filter banks, opening new possibilities for its application in signal analysis and processing. We present here the very first implementation of filter bank reassignment in a real-time analysis setting, more specifically as visualization in a basic audio player application. The visualization provides a low delay moving spectrogram with respect to virtually any time-frequency filter bank by interfacing the C backend of the LTFAT open-source toolbox for time-frequency processing. Low delay is achieved by blockwise processing, implemented with the JUCE C++ Library.
Download An active learning procedure for the interaural time difference discrimination threshold Measuring the auditory lateralization elicited by interaural time difference (ITD) cues involves the estimation of a psychometric function (PF). The shape of this function usually follows from the analysis of the subjective data and models the probability of correctly localizing the angular position of a sound source. The present study describes and evaluates a procedure for progressively fitting a PF, using Gaussian process classification of the subjective responses produced during a binary decision experiment. The process refines adaptively an approximated PF, following Bayesian inference. At each trial, it suggests the most informative auditory stimulus for function refinement according to Bayesian active learning by disagreement (BALD) mutual information. In this paper, the procedure was modified to accommodate two-alternative forced choice (2AFC) experimental methods and then was compared with a standard adaptive “three-down, one-up” staircase procedure. Our process approximates the average threshold ITD 79.4% correct level of lateralization with a mean accuracy increase of 8.9% over the Weibull function fitted on the data of the same test. The final accuracy for the Just Noticeable Difference (JND) in ITD is achieved with only 37.6% of the trials needed by a standard lateralization test.
Download Musical Signal Analysis Using Fractional-Delay Inverse Comb Filters A novel filter configuration for the analysis of harmonic musical signals is proposed. The method is based on inverse comb filtering that allows for the extraction of selected harmonic components or the background noise component between the harmonic spectral components. A highly accurate delay required in the inverse comb filter is implemented with a high-order allpass filter. The paper shows that the filter is easy to design, efficient to implement, and it enables accurate low-level feature analysis of musical tones. We describe several case studies to demonstrate the effectiveness of the proposed approach: isolating a single partial from a synthetic signal, analyzing the even-to-odd ratio of harmonics in a clarinet tone, and extracting the residual from a bowed string tone.
Download RIR2FDN: An Improved Room Impulse Response Analysis and Synthesis This paper seeks to improve the state-of-the-art in delay-networkbased analysis-synthesis of measured room impulse responses (RIRs). We propose an informed method incorporating improved energy decay estimation and synthesis with an optimized feedback delay network. The performance of the presented method is compared against an end-to-end deep-learning approach. A formal listening test was conducted where participants assessed the similarity of reverberated material across seven distinct RIRs and three different sound sources. The results reveal that the performance of these methods is influenced by both the excitation sounds and the reverberation conditions. Nonetheless, the proposed method consistently demonstrates higher similarity ratings compared to the end-to-end approach across most conditions. However, achieving an indistinguishable synthesis of measured RIRs remains a persistent challenge, underscoring the complexity of this problem. Overall, this work helps improve the sound quality of analysis-based artificial reverberation.
Download Robustness and independence of voice timbre features under live performance acoustic degradations Live performance situations can lead to degradations in the vocal signal from a typical microphone, such as ambient noise or echoes due to feedback. We investigate the robustness of continuousvalued timbre features measured on vocal signals (speech, singing, beatboxing) under simulated degradations. We also consider nonparametric dependencies between features, using information theoretic measures and a feature-selection algorithm. We discuss how robustness and independence issues reflect on the choice of acoustic features for use in constructing a continuous-valued vocal timbre space. While some measures (notably spectral crest factors) emerge as good candidates for such a task, others are poor, and some features such as ZCR exhibit an interaction with the type of voice signal being analysed.
Download GPGPU Audio Benchmark Framework Acceleration of audio workloads on generally-programmable GPU (GPGPU) hardware offers potentially high speedup factors, but also presents challenges in terms of development and deployment. We can increasingly depend on such hardware being available in users’ systems, yet few real-time audio products use this resource. We propose a suite of benchmarks to qualify a GPU as suitable for batch or real-time audio processing. This includes both microbenchmarks and higher-level audio domain benchmarks. We choose metrics based on application, paying particularly close attention to latency tail distribution. We propose an extension to the benchmark framework to more accurately simulate the real-world request pattern and performance requirements when running in a digital audio workstation. We run these benchmarks on two common consumer-level platforms: a PC desktop with a recent midrange discrete GPU and a Macintosh desktop with unified CPUGPU memory architecture.
Download Sound Analysis and Synthesis Adaptive in Time and Two Frequency Bands We present an algorithm for sound analysis and resynthesis with local automatic adaptation of time-frequency resolution. There exists several algorithms allowing to adapt the analysis window depending on its time or frequency location; in what follows we propose a method which select the optimal resolution depending on both time and frequency. We consider an approach that we denote as analysis-weighting, from the point of view of Gabor frame theory. We analyze in particular the case of different adaptive timevarying resolutions within two complementary frequency bands; this is a typical case where perfect signal reconstruction cannot in general be achieved with fast algorithms, causing a certain error to be minimized. We provide examples of adaptive analyses of a music sound, and outline several possibilities that this work opens.
Download On the window-disjoint-orthogonality of speech sources in reverberant humanoid scenarios Many speech source separation approaches are based on the assumption of orthogonality of speech sources in the time-frequency domain. The target speech source is demixed from the mixture by applying the ideal binary mask to the mixture. The time-frequency orthogonality of speech sources is investigated in detail only for anechoic and artificially mixed speech mixtures. This paper evaluates how the orthogonality of speech sources decreases when using a realistic reverberant humanoid recording setup and indicates strategies to enhance the separation capabilities of algorithms based on ideal binary masks under these conditions. It is shown that the SIR of the target source demixed from the mixture using the ideal binary mask decreases by approximately 3 dB for reverberation times of T60 = 0.6 s opposed to the anechoic scenario. For humanoid setups, the spatial distribution of the sources and the choice of the correct ear channel introduces differences in the SIR of further 3 dB, which leads to specific strategies to choose the best channel for demixing.