Download A Statistics-Driven Differentiable Approach for Sound Texture Synthesis and Analysis
In this work, we introduce TexStat, a novel loss function specifically designed for the analysis and synthesis of texture sounds characterized by stochastic structure and perceptual stationarity. Drawing inspiration from the statistical and perceptual framework of McDermott and Simoncelli, TexStat identifies similarities between signals belonging to the same texture category without relying on temporal structure. We also propose using TexStat as a validation metric alongside Frechet Audio Distances (FAD) to evaluate texture sound synthesis models. In addition to TexStat, we present TexEnv, an efficient, lightweight and differentiable texture sound synthesizer that generates audio by imposing amplitude envelopes on filtered noise. We further integrate these components into TexDSP, a DDSP-inspired generative model tailored for texture sounds. Through extensive experiments across various texture sound types, we demonstrate that TexStat is perceptually meaningful, time-invariant, and robust to noise, features that make it effective both as a loss function for generative tasks and as a validation metric. All tools and code are provided as open-source contributions and our PyTorch implementations are efficient, differentiable, and highly configurable, enabling its use in both generative tasks and as a perceptually grounded evaluation metric.
Download An Efficient Algorithm for Real-Time Spectrogram Inversion
We present a computationally efficient real-time algorithm for constructing audio signals from spectrograms. Spectrograms consist of a time sequence of short time Fourier transform magnitude (STFTM) spectra. During the audio signal construction process, phases are derived for the individual frequency components so that the spectrogram of the constructed signal is as close as possible to the target spectrogram given real-time constraints. The algorithm is a variation of the classic Griffin and Lim [1] technique modified to be computable in real-time. We discuss the application of the algorithm to time-scale modification of audio signals such as speech and music, and performance is compared with other methods. The new algorithm generates comparable or better results with significantly less computation. The phase consistency between adjacent frames produces excellent subjective sound quality with minimal fame transition artifacts.