Download On the Challenges of Embedded Real-Time Music Information Retrieval Real-time applications of Music Information Retrieval (MIR) have been gaining interest as of recently. However, as deep learning becomes more and more ubiquitous for music analysis tasks, several challenges and limitations need to be overcome to deliver accurate and quick real-time MIR systems. In addition, modern embedded computers offer great potential for compact systems that use MIR algorithms, such as digital musical instruments. However, embedded computing hardware is generally resource constrained, posing additional limitations. In this paper, we identify and discuss the challenges and limitations of embedded real-time MIR. Furthermore, we discuss potential solutions to these challenges, and demonstrate their validity by presenting an embedded real-time classifier of expressive acoustic guitar techniques. The classifier achieved 99.2% accuracy in distinguishing pitched and percussive techniques and a 99.1% average accuracy in distinguishing four distinct percussive techniques with a fifth class for pitched sounds. The full classification task is a considerably more complex learning problem, with our preliminary results reaching only 56.5% accuracy. The results were produced with an average latency of 30.7 ms.
Download Antialiased Black-Box Modeling of Audio Distortion Circuits Using Real Linear Recurrent Units In this paper, we propose the use of real-valued Linear Recurrent
Units (LRUs) for black-box modeling of audio circuits. A network architecture composed of real LRU blocks interleaved with
nonlinear processing stages is proposed.
Two case studies are
presented, a second-order diode clipper and an overdrive distortion pedal. Furthermore, we show how to integrate the antiderivative antialiaisng technique into the proposed method, effectively
lowering oversampling requirements. Our experiments show that
the proposed method generates models that accurately capture the
nonlinear dynamics of the examined devices and are highly efficient, which makes them suitable for real-time operation inside
Digital Audio Workstations.
Download One-to-Many Conversion for Percussive Samples A filtering algorithm for generating subtle random variations in
sampled sounds is proposed. Using only one recording for impact
sound effects or drum machine sounds results in unrealistic repetitiveness during consecutive playback. This paper studies spectral
variations in repeated knocking sounds and in three drum sounds:
a hihat, a snare, and a tomtom. The proposed method uses a short
pseudo-random velvet-noise filter and a low-shelf filter to produce
timbral variations targeted at appropriate spectral regions, yielding potentially an endless number of new realistic versions of a
single percussive sampled sound.
The realism of the resulting
processed sounds is studied in a listening test. The results show
that the sound quality obtained with the proposed algorithm is at
least as good as that of a previous method while using 77% fewer
computational operations. The algorithm is widely applicable to
computer-generated music and game audio.
Download Diet Deep Generative Audio Models With Structured Lottery Deep learning models have provided extremely successful solutions in most audio application fields. However, the high accuracy
of these models comes at the expense of a tremendous computation cost. This aspect is almost always overlooked in evaluating the
quality of proposed models. However, models should not be evaluated without taking into account their complexity. This aspect
is especially critical in audio applications, which heavily relies on
specialized embedded hardware with real-time constraints.
In this paper, we build on recent observations that deep models are highly overparameterized, by studying the lottery ticket hypothesis on deep generative audio models. This hypothesis states
that extremely efficient small sub-networks exist in deep models
and would provide higher accuracy than larger models if trained in
isolation. However, lottery tickets are found by relying on unstructured masking, which means that resulting models do not provide
any gain in either disk size or inference time. Instead, we develop
here a method aimed at performing structured trimming. We show
that this requires to rely on global selection and introduce a specific criterion based on mutual information.
First, we confirm the surprising result that smaller models provide higher accuracy than their large counterparts. We further
show that we can remove up to 95% of the model weights without significant degradation in accuracy. Hence, we can obtain very
light models for generative audio across popular methods such as
Wavenet, SING or DDSP, that are up to 100 times smaller with
commensurate accuracy. We study the theoretical bounds for embedding these models on Raspberry Pi and Arduino, and show that
we can obtain generative models on CPU with equivalent quality
as large GPU models. Finally, we discuss the possibility of implementing deep generative audio models on embedded platforms.
Download Amp-Space: A Large-Scale Dataset for Fine-Grained Timbre Transformation We release Amp-Space, a large-scale dataset of paired audio
samples: a source audio signal, and an output signal, the result of
a timbre transformation. The types of transformations we study
are from blackbox musical tools (amplifiers, stompboxes, studio
effects) traditionally used to shape the sound of guitar, bass, or
synthesizer sounds. For each sample of transformed audio, the
set of parameters used to create it are given. Samples are from
both real and simulated devices, the latter allowing for orders of
magnitude greater data than found in comparable datasets. We
demonstrate potential use cases of this data by (a) pre-training a
conditional WaveNet model on synthetic data and show that it reduces the number of samples necessary to digitally reproduce a
real musical device, and (b) training a variational autoencoder to
shape a continuous space of timbre transformations for creating
new sounds through interpolation.
Download A New Paradigm for Sound Design A sound scene can be defined as any “environmental” sound that has a consistent background texture, with one or more potentially recurring foreground events. We describe a data-driven framework for analyzing, transforming, and synthesizing high-quality sound scenes, with flexible control over the components of the synthesized sound. Given one or more sound scenes, we provide well-defined means to: (1) identify points of interest in the sound and extract them into reusable templates, (2) transform sound components independently of the background or other events, (3) continually re-synthesize the background texture in a perceptually convincing manner, and (4) controllably place event templates over the background, varying key parameters such as density, periodicity, relative loudness, and spatial positioning. Contributions include: techniques and paradigms for template selection and extraction, independent sound transformation and flexible re-synthesis; extensions to a wavelet-based background analysis/synthesis; and user interfaces to facilitate the various phases. Given this framework, it is possible to completely transform an existing sound scene, dynamically generate sound scenes of unlimited length, and construct new sound scenes by combining elements from different sound scenes. URL: http://taps.cs.princeton.edu/
Download Leveraging Electric Guitar Tones and Effects to Improve Robustness in Guitar Tablature Transcription Modeling Guitar tablature transcription (GTT) aims at automatically generating symbolic representations from real solo guitar performances. Due to its applications in education and musicology, GTT has gained traction in recent years. However, GTT robustness has been limited due to the small size of available datasets. Researchers have recently used synthetic data that simulates guitar performances using pre-recorded or computer-generated tones, allowing for scalable and automatic data generation. The present study complements these efforts by demonstrating that GTT robustness can be improved by including synthetic training data created using recordings of real guitar tones played with different audio effects. We evaluate our approach on a new evaluation dataset with professional solo guitar performances that we composed and collected, featuring a wide array of tones, chords, and scales.
Download Partiels – Exploring, Analyzing and Understanding Sounds This
article
presents
Partiels,
an
open-source
application
developed at IRCAM to analyze digital audio files and explore
sound characteristics.
The application uses Vamp plug-ins to
extract various information on different aspects of the sound, such
as spectrum, partials, pitch, tempo, text, and chords. Partiels is the
successor to AudioSculpt, offering a modern, flexible interface for
visualizing, editing, and exporting analysis results, addressing a
wide range of issues from musicological practice to sound creation
and signal processing research. The article describes Partiels’ key
features, including analysis organization, audio file management,
results visualization and editing, as well as data export and sharing
options, and its interoperability with other software such as Max
and Pure Data. In addition, it highlights the numerous analysis
plug-ins developed at IRCAM, based in particular on machine
learning models, as well as the IRCAM Vamp extension, which
overcomes certain limitations of the original Vamp format.
Download Inference-Time Structured Pruning for Real-Time Neural Network Audio Effects Structured pruning is a technique for reducing the computational
load and memory footprint of neural networks by removing structured subsets of parameters according to a predefined schedule
or ranking criterion.
This paper investigates the application of
structured pruning to real-time neural network audio effects, focusing on both feedforward networks and recurrent architectures.
We evaluate multiple pruning strategies at inference time, without retraining, and analyze their effects on model performance. To
quantify the trade-off between parameter count and audio fidelity,
we construct a theoretical model of the approximation error as a
function of network architecture and pruning level. The resulting bounds establish a principled relationship between pruninginduced sparsity and functional error, enabling informed deployment of neural audio effects in constrained real-time environments.
Download Audio Processor Parameters: Estimating Distributions Instead of Deterministic Values Audio effects and sound synthesizers are widely used processors
in popular music.
Their parameters control the quality of the
output sound. Multiple combinations of parameters can lead to
the same sound.
While recent approaches have been proposed
to estimate these parameters given only the output sound, those
are deterministic, i.e. they only estimate a single solution among
the many possible parameter configurations.
In this work, we
propose to model the parameters as probability distributions instead
of deterministic values. To learn the distributions, we optimize
two objectives: (1) we minimize the reconstruction error between
the ground truth output sound and the one generated using the
estimated parameters, asisit usuallydone, but also(2)we maximize
the parameter diversity, using entropy. We evaluate our approach
through two numerical audio experiments to show its effectiveness.
These results show how our approach effectively outputs multiple
combinations of parameters to match one sound.