A Statistics-Driven Differentiable Approach for Sound Texture Synthesis and Analysis
In this work, we introduce TexStat, a novel loss function specifically designed for the analysis and synthesis of texture sounds
characterized by stochastic structure and perceptual stationarity.
Drawing inspiration from the statistical and perceptual framework
of McDermott and Simoncelli, TexStat identifies similarities
between signals belonging to the same texture category without
relying on temporal structure. We also propose using TexStat
as a validation metric alongside Frechet Audio Distances (FAD) to
evaluate texture sound synthesis models. In addition to TexStat,
we present TexEnv, an efficient, lightweight and differentiable
texture sound synthesizer that generates audio by imposing amplitude envelopes on filtered noise. We further integrate these components into TexDSP, a DDSP-inspired generative model tailored
for texture sounds. Through extensive experiments across various
texture sound types, we demonstrate that TexStat is perceptually meaningful, time-invariant, and robust to noise, features that
make it effective both as a loss function for generative tasks and as
a validation metric. All tools and code are provided as open-source
contributions and our PyTorch implementations are efficient, differentiable, and highly configurable, enabling its use in both generative tasks and as a perceptually grounded evaluation metric.