Download Interpretable timbre synthesis using variational autoencoders regularized on timbre descriptors Controllable timbre synthesis has been a subject of research for several decades, and deep neural networks have been the most successful in this area. Deep generative models such as Variational Autoencoders (VAEs) have the ability to generate a high-level representation of audio while providing a structured latent space. Despite their advantages, the interpretability of these latent spaces in terms of human perception is often limited. To address this limitation and enhance the control over timbre generation, we propose a regularized VAE-based latent space that incorporates timbre descriptors. Moreover, we suggest a more concise representation of sound by utilizing its harmonic content, in order to minimize the dimensionality of the latent space.
Download Vocal Tract Area Estimation by Gradient Descent Articulatory features can provide interpretable and flexible controls for the synthesis of human vocalizations by allowing the user to directly modify parameters like vocal strain or lip position. To make this manipulation through resynthesis possible, we need to estimate the features that result in a desired vocalization directly from audio recordings. In this work, we propose a white-box optimization technique for estimating glottal source parameters and vocal tract shapes from audio recordings of human vowels. The approach is based on inverse filtering and optimizing the frequency response of a waveguide model of the vocal tract with gradient descent, propagating error gradients through the mapping of articulatory features to the vocal tract area function. We apply this method to the task of matching the sound of the Pink Trombone, an interactive articulatory synthesizer, to a given vocalization. We find that our method accurately recovers control functions for audio generated by the Pink Trombone itself. We then compare our technique against evolutionary optimization algorithms and a neural network trained to predict control parameters from audio. A subjective evaluation finds that our approach outperforms these black-box optimization baselines on the task of reproducing human vocalizations.
Download Vocal Timbre Effects with Differentiable Digital Signal Processing We explore two approaches to creatively altering vocal timbre using Differentiable Digital Signal Processing (DDSP). The first approach is inspired by classic cross-synthesis techniques. A pretrained DDSP decoder predicts a filter for a noise source and a harmonic distribution, based on pitch and loudness information extracted from the vocal input. Before synthesis, the harmonic distribution is modified by interpolating between the predicted distribution and the harmonics of the input. We provide a real-time implementation of this approach in the form of a Neutone model. In the second approach, autoencoder models are trained on datasets consisting of both vocal and instrument training data. To apply the effect, the trained autoencoder attempts to reconstruct the vocal input. We find that there is a desirable “sweet spot” during training, where the model has learned to reconstruct the phonetic content of the input vocals, but is still affected by the timbre of the instrument mixed into the training data. After further training, that effect disappears. A perceptual evaluation compares the two approaches. We find that the autoencoder in the second approach is able to reconstruct intelligible lyrical content without any explicit phonetic information provided during training.
Download Explicit Vector Wave Digital Filter Modeling of Circuits with a Single Bipolar Junction Transistor The recently developed extension of Wave Digital Filters based on vector wave variables has broadened the class of circuits with linear two-port elements that can be modeled in a modular and explicit fashion in the Wave Digital (WD) domain. In this paper, we apply the vector definition of wave variables to nonlinear twoport elements. In particular, we present two vector WD models of a Bipolar Junction Transistor (BJT) using characteristic equations derived from an extended Ebers-Moll model. One, implicit, is based on a modified Newton-Raphson method; the other, explicit, is based on a neural network trained in the WD domain and it is shown to allow fully explicit implementation of circuits with a single BJT, which can be executed in real time.
Download P-RAVE: Improving RAVE through pitch conditioning and more with application to singing voice conversion In this paper, we introduce means of improving fidelity and controllability of the RAVE generative audio model by factorizing pitch and other features. We accomplish this primarily by creating a multi-band excitation signal capturing pitch and/or loudness information, and by using it to FiLM-condition the RAVE generator. To further improve fidelity when applied to a singing voice application explored here, we also consider concatenating a supervised phonetic encoding to its latent representation. An ablation analysis highlights the improved performance of our incremental improvements relative to the baseline RAVE model. As our primary enhancement involves adding a stable pitch conditioning mechanism into the RAVE model, we simply call our method P-RAVE.
Download What you hear is what you see: Audio quality from Image Quality Metrics In this study, we investigate the feasibility of utilizing stateof-the-art perceptual image metrics for evaluating audio signals by representing them as spectrograms. The encouraging outcome of the proposed approach is based on the similarity between the neural mechanisms in the auditory and visual pathways. Furthermore, we customise one of the metrics which has a psychoacoustically plausible architecture to account for the peculiarities of sound signals. We evaluate the effectiveness of our proposed metric and several baseline metrics using a music dataset, with promising results in terms of the correlation between the metrics and the perceived quality of audio as rated by human evaluators.
Download Real-Time Singing Voice Conversion Plug-In In this paper, we propose an approach to real-time singing voice conversion and outline its development as a plug-in suitable for streaming use in a digital audio workstation. In order to simultaneously ensure pitch preservation and reduce the computational complexity of the overall system, we adopt a source-filter methodology and consider a vocoder-free paradigm for modeling the conversion task. In this case, the source is extracted and altered using more traditional DSP techniques, while the filter is determined using a deep neural network. The latter can be trained in an end-toend fashion and additionally uses adversarial training to improve system fidelity. Careful design allows the system to scale naturally to sampling rates higher than the neural filter model sampling rate, outputting full-band signals while avoiding the need for resampling. Accordingly, the resulting system, when operating at 44.1 kHz, incurs under 60 ms of latency and operates 20 times faster than real-time on a standard laptop CPU.
Download Efficient simulation of the yaybahar using a modal approach This work presents a physical model of the yaybahar, a recently invented acoustic instrument. Here, output from a bowed string is passed through a long spring, before being amplified and propagated in air via a membrane. The highly dispersive character of the spring is responsible for the typical synthetic tonal quality of this instrument. Building on previous literature, this work presents a modal discretisation of the full system, with fine control over frequency-dependent decay times, modal amplitudes and frequencies, all essential for an accurate simulation of the dispersive characteristics of reverberation. The string-bow-bridge system is also solved in the modal domain, using recently developed noniterative numerical methods allowing for efficient simulation.
Download Antialiasing Piecewise Polynomial Waveshapers Memoryless waveshapers are commonly used in audio signal processing. In discrete time, they suffer from well-known aliasing artifacts. We present a method for applying antiderivative antialising (ADAA), which mitigates aliasing, to any waveshaping function that can be represented as a piecewise polynomial. Specifically, we treat the special case of a piecewise linear waveshaper. Furthermore, we introduce a method for for replacing the sharp corners and jump discontinuities in any piecewise linear waveshaper with smoothed polynomial approximations, whose derivatives match the adjacent line segments up to a specified order. This piecewise polynomial can again be antialiased as a special case of the general piecewise polynomial. Especially when combined with light oversampling, these techniques are effective at reducing aliasing and the proposed method for rounding corners in piecewise linear waveshapers can also create more “realistic” analog-style waveshapers than standard piecewise linear functions.
Download How Smooth Do You Think I Am: An Analysis on the Frequency-Dependent Temporal Roughness of Velvet Noise Velvet noise is a sparse pseudo-random signal, with applications in late reverberation modeling, decorrelation, speech generation, and extending signals. The temporal roughness of broadband velvet noise has been studied earlier. However, the frequency-dependency of the temporal roughness has little previous research. This paper explores which combinative qualities such as pulse density, filter type, and filter shape contribute to frequency-dependent temporal roughness. An adaptive perceptual test was conducted to find minimal densities of smooth noise at octave bands as well as corresponding lowpass bands. The results showed that the cutoff frequency of a lowpass filter as well as the center frequency of an octave filter is correlated with the perceived minimal density of smooth noise. When the lowpass filter with the lowest cutoff frequency, 125 Hz, was applied, the filtered velvet noise sounded smooth at an average of 725 pulses/s and an average of 401 pulses/s for octave filtered noise at a center frequency of 125 Hz. For the broadband velvet noise, the minimal density of smoothness was found to be at an average of 1554 pulses/s. The results of this paper are applicable in designing velvet-noise-based artificial reverberation with minimal pulse density.