Download Audio Effect Chain Estimation and Dry Signal Recovery From Multi-Effect-Processed Musical Signals In this paper we propose a method that can address a novel task, audio effect (AFX) chain estimation and dry signal recovery. AFXs are indispensable in modern sound design workflows. Sound engineers often cascade different AFXs (as an AFX chain) to achieve their desired soundscapes. Given a multi-AFX-applied solo instrument performance (wet signal), our method can automatically estimate the applied AFX chain and recover its unprocessed dry signal, while previous research only addresses one of them. The estimated chain is useful for novice engineers in learning practical usages of AFXs, and the recovered signal can be reused with a different AFX chain. To solve this task, we first develop a deep neural network model that estimates the last-applied AFX and undoes its AFX at a time. We then iteratively apply the same model to estimate the AFX chain and eventually recover the dry signal from the wet signal. Our experiments on guitar phrase recordings with various AFX chains demonstrate the validity of our method for both the AFX-chain estimation and dry signal recovery. We also confirm that the input wet signal can be reproduced by applying the estimated AFX chain to the recovered dry signal.
Download Improving Lyrics-to-Audio Alignment Using Frame-wise Phoneme Labels with Masked Cross Entropy Loss This paper addresses the task of lyrics-to-audio alignment, which
involves synchronizing textual lyrics with corresponding music
audio. Most publicly available datasets for this task provide annotations only at the line or word level. This poses a challenge
for training lyrics-to-audio models due to the lack of frame-wise
phoneme labels. However, we find that phoneme labels can be
partially derived from word-level annotations: for single-phoneme
words, all frames corresponding to the word can be labeled with
the same phoneme; for multi-phoneme words, phoneme labels can
be assigned at the first and last frames of the word. To leverage
this partial information, we construct a mask for those frames and
propose a masked frame-wise cross-entropy (CE) loss that considers only frames with known phoneme labels. As a baseline model,
we adopt an autoencoder trained with a Connectionist Temporal
Classification (CTC) loss and a reconstruction loss. We then enhance the training process by incorporating the proposed framewise masked CE loss. Experimental results show that incorporating the frame-wise masked CE loss improves alignment performance. In comparison to other state-of-the art models, our model
provides a comparable Mean Absolute Error (MAE) of 0.216 seconds and a top Median Absolute Error (MedAE) of 0.041 seconds
on the testing Jamendo dataset.