Download Combining classifications based on local and global features: application to singer identification In this paper we investigate the problem of singer identification on acapella recordings of isolated notes. Most of studies on singer identification describe the content of signals of singing voice with features related to the timbre (such as MFCC or LPC). These features aim to describe the behavior of frequencies at a given instant of time (local features). In this paper, we propose to describe sung tone with the temporal variations of the fundamental frequency (and its harmonics) of the note. The periodic and continuous variations of the frequency trajectories are analyzed on the whole note and the features obtained reflect expressive and intonative elements of singing such as vibrato, tremolo and portamento. The experiments, conducted on two distinct data-sets (lyric and pop-rock singers), prove that the new set of features capture a part of the singer identity. However, these features are less accurate than timbre-based features. We propose to increase the recognition rate of singer identification by combining information conveyed by local and global description of notes. The proposed method, that shows good results, can be adapted for classification problem involving a large number of classes, or to combine classifications with different levels of performance.
Download Hyperbolic Embeddings for Order-Aware Classification of Audio Effect Chains Audio effects (AFXs) are essential tools in music production, frequently applied in chains to shape timbre and dynamics. The order of AFXs in a chain plays a crucial role in determining the final sound, particularly when non-linear (e.g., distortion) or timevariant (e.g., chorus) processors are involved. Despite its importance, most AFX-related studies have primarily focused on estimating effect types and their parameters from a wet signal. To
address this gap, we formulate AFX chain recognition as the task
of jointly estimating AFX types and their order from a wet signal.
We propose a neural-network-based method that embeds wet signals into a hyperbolic space and classifies their AFX chains. Hyperbolic space can represent tree-structured data more efficiently
than Euclidean space due to its exponential expansion property.
Since AFX chains can be represented as trees, with AFXs as nodes
and edges encoding effect order, hyperbolic space is well-suited
for modeling the exponentially growing and non-commutative nature of ordered AFX combinations, where changes in effect order can result in different final sounds. Experiments using guitar
sounds demonstrate that, with an appropriate curvature, the proposed method outperforms its Euclidean counterpart. Further analysis based on AFX type and chain length highlights the effectiveness of the proposed method in capturing AFX order.
Download Modelling Experts’ Decisions on Assigning Narrative Importances of Objects in a Radio Drama Mix There is an increasing number of consumers of broadcast audio who suffer from a degree of hearing impairment. One of the methods developed for tackling this issue consists of creating customizable object-based audio mixes where users can attenuate parts of the mix using a simple complexity parameter. The method relies on the mixing engineer classifying audio objects in the mix according to their narrative importance. This paper focuses on automating this process. Individual tracks are classified based on their music, speech, or sound effect content. Then the decisions for assigning narrative importance to each segment of a radio drama mix are modelled using mixture distributions. Finally, the learned decisions and resultant mixes are evaluated using the Short Term Objective Intelligibility, with reference to the narrative importance selections made by the original producer. This approach has applications for providing customizable mixes for legacy content, or automatically generated media content where the engineer is not able to intervene.
Download Improving Lyrics-to-Audio Alignment Using Frame-wise Phoneme Labels with Masked Cross Entropy Loss This paper addresses the task of lyrics-to-audio alignment, which
involves synchronizing textual lyrics with corresponding music
audio. Most publicly available datasets for this task provide annotations only at the line or word level. This poses a challenge
for training lyrics-to-audio models due to the lack of frame-wise
phoneme labels. However, we find that phoneme labels can be
partially derived from word-level annotations: for single-phoneme
words, all frames corresponding to the word can be labeled with
the same phoneme; for multi-phoneme words, phoneme labels can
be assigned at the first and last frames of the word. To leverage
this partial information, we construct a mask for those frames and
propose a masked frame-wise cross-entropy (CE) loss that considers only frames with known phoneme labels. As a baseline model,
we adopt an autoencoder trained with a Connectionist Temporal
Classification (CTC) loss and a reconstruction loss. We then enhance the training process by incorporating the proposed framewise masked CE loss. Experimental results show that incorporating the frame-wise masked CE loss improves alignment performance. In comparison to other state-of-the art models, our model
provides a comparable Mean Absolute Error (MAE) of 0.216 seconds and a top Median Absolute Error (MedAE) of 0.041 seconds
on the testing Jamendo dataset.
Download Improving Synthesizer Programming From Variational Autoencoders Latent Space Deep neural networks have been recently applied to the task of
automatic synthesizer programming, i.e., finding optimal values
of sound synthesis parameters in order to reproduce a given input
sound. This paper focuses on generative models, which can infer
parameters as well as generate new sets of parameters or perform
smooth morphing effects between sounds.
We introduce new models to ensure scalability and to increase
performance by using heterogeneous representations of parameters as numerical and categorical random variables.
Moreover,
a spectral variational autoencoder architecture with multi-channel
input is proposed in order to improve inference of parameters related to the pitch and intensity of input sounds.
Model performance was evaluated according to several criteria
such as parameters estimation error and audio reconstruction accuracy. Training and evaluation were performed using a 30k presets
dataset which is published with this paper. They demonstrate significant improvements in terms of parameter inference and audio
accuracy and show that presented models can be used with subsets
or full sets of synthesizer parameters.
Download Relative Music Loudness Estimation Using Temporal Convolutional Networks and a CNN Feature Extraction Front-End Relative music loudness estimation is a MIR task that consists in
dividing audio in segments of three classes: Foreground Music,
Background Music and No Music. Given the temporal correlation
of music, in this work we approach the task using a type of network
with the ability to model temporal context: the Temporal Convolutional Network (TCN). We propose two architectures: a TCN,
and a novel architecture resulting from the combination of a TCN
with a Convolutional Neural Network (CNN) front-end. We name
this new architecture CNN-TCN. We expect the CNN front-end to
work as a feature extraction strategy to achieve a more efficient usage of the network’s parameters. We use the OpenBMAT dataset
to train and test 40 TCN and 80 CNN-TCN models with two grid
searches over a set of hyper-parameters. We compare our models with the two best algorithms submitted to the tasks of music
detection and relative music loudness estimation in MIREX 2019.
All our models outperform the MIREX algorithms even when using a lower number of parameters. The CNN-TCN emerges as the
best architecture as all its models outperform all TCN models. We
show that adding a CNN front-end to a TCN can actually reduce
the number of parameters of the network while improving performance. The CNN front-end effectively works as a feature extractor producing consistent patterns that identify different combinations of music and non-music sounds and also helps in producing
a smoother output in comparison to the TCN models.
Download DDSP-Based Neural Waveform Synthesis of Polyphonic Guitar Performance From String-Wise MIDI Input We explore the use of neural synthesis for acoustic guitar from string-wise MIDI input. We propose four different systems and compare them with both objective metrics and subjective evaluation against natural audio and a sample-based baseline. We iteratively develop these four systems by making various considerations on the architecture and intermediate tasks, such as predicting pitch and loudness control features. We find that formulating the control feature prediction task as a classification task rather than a regression task yields better results. Furthermore, we find that our simplest proposed system, which directly predicts synthesis parameters from MIDI input performs the best out of the four proposed systems. Audio examples and code are available.
Download Using tensor factorisation models to separate drums from polyphonic music This paper describes the use of Non-negative Tensor Factorisation models for the separation of drums from polyphonic audio. Improved separation of the drums is achieved through the incorporation of Gamma Chain priors into the Non-negative Tensor Factorisation framework. In contrast to many previous approaches, the method used in this paper requires little or no pre-training or use of drum templates. The utility of the technique is shown on real-world audio examples.
Download Modelling of nonlinear state-space systems using a deep neural network In this paper we present a new method for the pseudo black-box modelling of general continuous-time state-space systems using a discrete-time state-space system with an embedded deep neural network. Examples are given of how this method can be applied to a number of common nonlinear electronic circuits used in music technology, namely two kinds of diode-based guitar distortion circuits and the lowpass filter of the Korg MS-20 synthesizer.