Download Real-Time System for Sound Enhancement in Noisy Environment
The noise can affect the listening experience in many real-life situations involving loudspeakers as a playback device. A solution to reduce the effect of the noise is to employ headphones, but they can be annoying and not allowed on some occasions. In this context, a system for improving the audio perception and the intelligibility of sounds in a domestic noisy environment is introduced and a real-time implementation is proposed. The system comprises three main blocks: a noise estimation procedure based on an adaptive algorithm, an auditory spectral masking algorithm that estimates the music threshold capable of masking the noise source, and an FFT equalizer that is used to apply the estimated level. It has been developed on an embedded DSP board considering one microphone for the ambient noise analysis and two vibrating sound transducers for sound reproduction. Several experiments on simulated and real-world scenarios have been realized to prove the effectiveness of the proposed approach.
Download Differentiable All-Pole Filters for Time-Varying Audio Systems
Infinite impulse response filters are an essential building block of many time-varying audio systems, such as audio effects and synthesisers. However, their recursive structure impedes end-toend training of these systems using automatic differentiation. Although non-recursive filter approximations like frequency sampling and frame-based processing have been proposed and widely used in previous works, they cannot accurately reflect the gradient of the original system. We alleviate this difficulty by reexpressing a time-varying all-pole filter to backpropagate the gradients through itself, so the filter implementation is not bound to the technical limitations of automatic differentiation frameworks. This implementation can be employed within audio systems containing filters with poles for efficient gradient evaluation. We demonstrate its training efficiency and expressive capabilities for modelling real-world dynamic audio systems on a phaser, time-varying subtractive synthesiser, and feed-forward compressor. We make our code and audio samples available and provide the trained audio effect and synth models in a VST plugin1 .
Download Empirical Results for Adjusting Truncated Backpropagation Through Time While Training Neural Audio Effects
This paper investigates the optimization of Truncated Backpropagation Through Time (TBPTT) for training neural networks in digital audio effect modeling, with a focus on dynamic range compression. The study evaluates key TBPTT hyperparameters – sequence number, batch size, and sequence length – and their influence on model performance. Using a convolutional-recurrent architecture, we conduct extensive experiments across datasets with and without conditioning by user controls. Results demonstrate that carefully tuning these parameters enhances model accuracy and training stability, while also reducing computational demands. Objective evaluations confirm improved performance with optimized settings, while subjective listening tests indicate that the revised TBPTT configuration maintains high perceptual quality.
Download Evaluating the Performance of Objective Audio Quality Metrics in Response to Common Audio Degradations
This study evaluates the performance of five objective audio quality metrics—PEAQ Basic, PEAQ Advanced, PEMO-Q, ViSQOL, and HAAQI —in the context of digital music production. Unlike previous comparisons, we focus on their suitability for production environments, an area currently underexplored in existing research. Twelve audio examples were tested using two evaluation types: an effectiveness test under progressively increasing degradations (hum, hiss, clipping, glitches) and a robustness test under fixed-level, randomly fluctuating degradations. In the effectiveness test, HAAQI, PEMO-Q, and PEAQ Basic effectively tracked degradation changes, while PEAQ Advanced failed consistently and ViSQOL showed low sensitivity to hum and glitches. In the robustness test, ViSQOL and HAAQI demonstrated the highest consistency, with average standard deviations of 0.004 and 0.007, respectively, followed by PEMO-Q (0.021), PEAQ Basic (0.057), and PEAQ Advanced (0.065). However, ViSQOL also showed low variability across audio examples, suggesting limited genre sensitivity. These findings highlight the strengths and limitations of each metric for music production, specifically quality measurement with compressed audio. The source code and dataset will be made publicly available upon publication.
Download Analysis of Musical Dynamics in Vocal Performances Using Loudness Measures
In addition to tone, pitch and rhythm, dynamics is one of the expressive dimensions of the performance of a music piece that has received limited attention. While the usage of dynamics may vary from artist to artist, and also from performance to performance, a systematic methodology to automatically identify the dynamics of a performance in terms of musically meaningful terms like forte, piano may offer valuable feedback in the context of music education and in particular in singing. To this end, we have manually annotated the dynamic markings of commercial recordings of popular rock and pop songs from the Smule Vocal Balanced (SVB) dataset which will be used as reference data. Then as a first step for our research goal, we propose a method to derive and compare singing voice loudness curves in polyphonic mixtures. Towards measuring the similarity and variation of dynamics, we compare the dynamics curves of the SVB renditions with the one derived from the original songs. We perform the same comparison using professionally produced renditions from a karaoke website. We relate high values of Spearman correlation coefficient found in some select student renditions and the professional renditions with accurate dynamics.
Download HD-AD: A New Approach to Audio Atomic Decomposition with Hyperdimensional Computing
In this paper, we approach the problem of atomic decomposition of audio at the symbolic level of atom parameters through the lens of hyperdimensional computing (HDC) – a non-traditional computing paradigm. Existing atomic decomposition algorithms often operate using waveforms from a redundant dictionary of atoms causing them to become increasingly memory/computationally intensive as the signal length grows and/or the atoms become more complicated. We systematically build an atom encoding using vector function architecture (VFA), a field of HDC. We train a neural network encoder on synthetic audio signals to generate these encodings and observe that the network can generalize to real recordings. This system, we call Hyperdimensional Atomic Decomposition (HD-AD), avoids time-domain correlations all together. Because HD-AD scales with the sparsity of the signal, rather than its length in time, atomic decompositions are often produced much faster than real-time.
Download A Low-Latency Quasi-Linear-Phase Octave Graphic Equalizer
This paper proposes a low-latency quasi-linear-phase octave graphic equalizer. The structure is derived from a recent linearphase graphic equalizer based on interpolated finite impulse response (IFIR) filters. The proposed system reduces the total latency of the previous equalizer by implementing a hybrid structure. An infinite impulse response (IIR) shelving filter is used in the structure to implement the first band of the equalizer, whereas the rest of the band filters are realized with the linear-phase FIR structure. The introduction of the IIR filter causes a nonlinear phase response in the low frequencies, but the total latency is reduced by 50% in comparison to the linear-phase equalizer. The proposed graphic equalizer is useful in real-time audio processing, where only little latency is tolerated.
Download Dark Velvet Noise
This paper proposes dark velvet noise (DVN) as an extension of the original velvet noise with a lowpass spectrum. The lowpass spectrum is achieved by allowing each pulse in the sparse sequence to have a randomized pulse width. The cutoff frequency is controlled by the density of the sequence. The modulated pulse-width can be implemented efficiently utilizing a discrete set of recursive running-sum filters, one for each unique pulse width. DVN may be used in reverberation algorithms. Typical room reverberation has a frequency-dependent decay, where the high frequencies decay faster than the low ones. A similar effect is achieved by lowering the density and increasing the pulse-width of DVN in time, thereby making the DVN suitable for artificial reverberation.
Download Streamable Neural Audio Synthesis with Non-Causal Convolutions
Deep learning models are mostly used in an offline inference fashion. However, this strongly limits the use of these models inside audio generation setups, as most creative workflows are based on real-time digital signal processing. Although approaches based on recurrent networks can be naturally adapted to this buffer-based computation, the use of convolutions still poses some serious challenges. To tackle this issue, the use of causal streaming convolutions have been proposed. However, this requires specific complexified training and can impact the resulting audio quality. In this paper, we introduce a new method allowing to produce non-causal streaming models. This allows to make any convolutional model compatible with real-time buffer-based processing. As our method is based on a post-training reconfiguration of the model, we show that it is able to transform models trained without causal constraints into streaming models. We apply our method on the recent RAVE model as a case study. This model provides high-quality real-time audio synthesis on a wide range of signals and thus is an ideal candidate to evaluate our method. It should be noted that our method is not restricted to RAVE, and can be straightforwardly applied to any convolutional network. We test our approach on multiple music and speech datasets and show that it is faster than overlap-add methods, while having no impact on the generation quality. Finally, we introduce two open-source implementation of our work as Max/MSP and PureData externals, and as a VST audio plugin. This allows to endow traditional digital audio workstations with real-time neural audio synthesis.
Download Self-Supervised Disentanglement of Harmonic and Rhythmic Features in Music Audio Signals
The aim of latent variable disentanglement is to infer the multiple informative latent representations that lie behind a data generation process and is a key factor in controllable data generation. In this paper, we propose a deep neural network-based self-supervised learning method to infer the disentangled rhythmic and harmonic representations behind music audio generation. We train a variational autoencoder that generates an audio mel-spectrogram from two latent features representing the rhythmic and harmonic content. In the training phase, the variational autoencoder is trained to reconstruct the input mel-spectrogram given its pitch-shifted version. At each forward computation in the training phase, a vector rotation operation is applied to one of the latent features, assuming that the dimensions of the feature vectors are related to pitch intervals. Therefore, in the trained variational autoencoder, the rotated latent feature represents the pitch-related information of the mel-spectrogram, and the unrotated latent feature represents the pitch-invariant information, i.e., the rhythmic content. The proposed method was evaluated using a predictor-based disentanglement metric on the learned features. Furthermore, we demonstrate its application to the automatic generation of music remixes.