Download Improving Lyrics-to-Audio Alignment Using Frame-wise Phoneme Labels with Masked Cross Entropy Loss This paper addresses the task of lyrics-to-audio alignment, which
involves synchronizing textual lyrics with corresponding music
audio. Most publicly available datasets for this task provide annotations only at the line or word level. This poses a challenge
for training lyrics-to-audio models due to the lack of frame-wise
phoneme labels. However, we find that phoneme labels can be
partially derived from word-level annotations: for single-phoneme
words, all frames corresponding to the word can be labeled with
the same phoneme; for multi-phoneme words, phoneme labels can
be assigned at the first and last frames of the word. To leverage
this partial information, we construct a mask for those frames and
propose a masked frame-wise cross-entropy (CE) loss that considers only frames with known phoneme labels. As a baseline model,
we adopt an autoencoder trained with a Connectionist Temporal
Classification (CTC) loss and a reconstruction loss. We then enhance the training process by incorporating the proposed framewise masked CE loss. Experimental results show that incorporating the frame-wise masked CE loss improves alignment performance. In comparison to other state-of-the art models, our model
provides a comparable Mean Absolute Error (MAE) of 0.216 seconds and a top Median Absolute Error (MedAE) of 0.041 seconds
on the testing Jamendo dataset.
Download Perceptual Decorrelator Based on Resonators Decorrelation filters transform mono audio into multiple decorrelated copies. This paper introduces a novel decorrelation filter design based on a resonator bank, which produces a sum of over a thousand exponentially decaying sinusoids. A headphone listening test was used to identify the minimum inter-channel time delays that perceptually match ERB-filtered coherent noise to corresponding incoherent noise. The decay rate of each resonator is set based on a group delay profile determined by the listening test results at its corresponding frequency. Furthermore, the delays from the test are used to refine frequency-dependent windowing in coherence estimation, which we argue represents the perceptually most accurate way of assessing interaural coherence. This coherence measure then guides an optimization process that adjusts the initial phases of the sinusoids to minimize the coherence between two instances of the resonator-based decorrelator. The delay results establish the necessary group delay per ERB for effective decorrelation, revealing higher-than-expected values, particularly at higher frequencies. For comparison, the optimization is also performed using two previously proposed group-delay profiles: one based on the period of the ERB band center frequency and another based on the maximum group-delay limit before introducing smearing. The results indicate that the perceptually informed profile achieves equal decorrelation to the latter profile while smearing less at high frequencies. Overall, optimizing the phase response of the proposed decorrelator yields significantly lower coherence compared to using a random phase.
Download Room Acoustic Modelling Using a Hybrid Ray-Tracing/Feedback Delay Network Method Combining different room acoustic modelling methods could provide a better balance between perceptual plausibility and computational efficiency than using a single and potentially more computationally expensive model. In this work, a hybrid acoustic modelling system that integrates ray tracing (RT) with an advanced
feedback delay network (FDN) is designed to generate perceptually plausible RIRs. A multiple stimuli with hidden reference
and anchor (MUSHRA) test and a two-alternative-forced-choice
(2AFC) discrimination task have been conducted to compare the
proposed method against ground truth recordings and conventional
RT-based approaches. The results show that the proposed system
delivers robust performance in various scenarios, achieving highly
plausible reverberation synthesis.
Download Zero-Phase Sound via Giant FFT Given the speedy computation of the FFT in current computer
hardware, there are new possibilities for examining transformations for very long sounds. A zero-phase version of any audio
signal can be obtained by zeroing the phase angle of its complex
spectrum and taking the inverse FFT. This paper recommends additional processing steps, including zero-padding, transient suppression at the signal’s start and end, and gain compensation, to
enhance the resulting sound quality. As a result, a sound with the
same spectral characteristics as the original one, but with different temporal events, is obtained. Repeating rhythm patterns are
retained, however. Zero-phase sounds are palindromic in the sense
that they are symmetric in time. A comparison of the zero-phase
conversion to the autocorrelation function helps to understand its
properties, such as why the rhythm of the original sound is emphasized. It is also argued that the zero-phase signal has the same
autocorrelation function as the original sound. One exciting variation of the method is to apply the method separately to the real
and imaginary parts of the spectrum to produce a stereo effect. A
frame-based technique enables the use of the zero-phase conversion in real-time audio processing. The zero-phase conversion is
another member of the giant FFT toolset, allowing the modification of sampled sounds, such as drum loops or entire songs.
Download Evaluating the Performance of Objective Audio Quality Metrics in Response to Common Audio Degradations This study evaluates the performance of five objective audio quality metrics—PEAQ Basic, PEAQ Advanced, PEMO-Q, ViSQOL,
and HAAQI —in the context of digital music production. Unlike
previous comparisons, we focus on their suitability for production environments, an area currently underexplored in existing research. Twelve audio examples were tested using two evaluation
types: an effectiveness test under progressively increasing degradations (hum, hiss, clipping, glitches) and a robustness test under
fixed-level, randomly fluctuating degradations.
In the effectiveness test, HAAQI, PEMO-Q, and PEAQ Basic
effectively tracked degradation changes, while PEAQ Advanced
failed consistently and ViSQOL showed low sensitivity to hum
and glitches. In the robustness test, ViSQOL and HAAQI demonstrated the highest consistency, with average standard deviations
of 0.004 and 0.007, respectively, followed by PEMO-Q (0.021),
PEAQ Basic (0.057), and PEAQ Advanced (0.065).
However,
ViSQOL also showed low variability across audio examples, suggesting limited genre sensitivity.
These findings highlight the strengths and limitations of each
metric for music production, specifically quality measurement with
compressed audio. The source code and dataset will be made publicly available upon publication.
Download Comparing Acoustic and Digital Piano Actions: Data Analysis and Key Insights The acoustic piano and its sound production mechanisms have been
extensively studied in the field of acoustics. Similarly, digital piano synthesis has been the focus of numerous signal processing
research studies. However, the role of the piano action in shaping the dynamics and nuances of piano sound has received less
attention, particularly in the context of digital pianos. Digital pianos are well-established commercial instruments that typically use
weighted keys with two or three sensors to measure the average
key velocity—this being the only input to a sampling synthesis
engine. In this study, we investigate whether this simplified measurement method adequately captures the full dynamic behavior of
the original piano action. After a brief review of the state of the art,
we describe an experimental setup designed to measure physical
properties of the keys and hammers of a piano. This setup enables
high-precision readings of acceleration, velocity, and position for
both the key and hammer across various dynamic levels. Through
extensive data analysis, we examine their relationships and identify
the optimal key position for velocity measurement. We also analyze
a digital piano key to determine where the average key velocity is
measured and compare it with our proposed optimal timing. We
find that the instantaneous key velocity just before let-off correlates
most strongly with hammer impact velocity, indicating a target
for improved sensing; however, due to the limitations of discrete
velocity sensing this optimization alone may not suffice to replicate
the nuanced expressiveness of acoustic piano touch. This study
represents the first step in a broader research effort aimed at linking
piano touch, dynamics, and sound production.
Download Digital Morphophone Environment. Computer Rendering of a Pioneering Sound Processing Device This paper introduces a digital reconstruction of the morphophone,
a complex magnetophonic device developed in the 1950s within
the laboratories of the GRM (Groupe de Recherches Musicales)
in Paris. The analysis, design, and implementation methodologies
underlying the Digital Morphophone Environment are discussed.
Based on a detailed review of historical sources and limited
documentation – including a small body of literature and, most
notably, archival images – the core operational principles of the
morphophone have been modeled within the MAX visual programming environment. The main goals of this work are, on the one
hand, to study and make accessible a now obsolete and unavailable
tool, and on the other, to provide the opportunity for new explorations in computer music and research.
Download Towards an Objective Comparison of Panning Feature Algorithms for Unsupervised Learning Estimations of panning attributes are an important feature to extract from a piece of recorded music, with downstream uses such
as classification, quality assessment, and listening enhancement.
While several algorithms exist in the literature, there is currently
no comparison between them and no studies to suggest which one
is most suitable for any particular task. This paper compares four
algorithms for extracting amplitude panning features with respect
to their suitability for unsupervised learning. It finds synchronicities between them and analyses their results on a small set of
commercial music excerpts chosen for their distinct panning features. The ability of each algorithm to differentiate between the
tracks is analysed. The results can be used in future work to either
select the most appropriate panning feature algorithm or create a
version customized for a particular task.
Download Impedance Synthesis for Hybrid Analog-Digital Audio Effects Most real systems, from acoustics to analog electronics, are
characterised by bidirectional coupling amongst elements rather
than neat, unidirectional signal flows between self-contained modules. Integrating digital processing into physical domains becomes
a significant engineering challenge when the application requires
bidirectional coupling across the physical-digital boundary rather
than separate, well-defined inputs and outputs. We introduce an
approach to hybrid analog-digital audio processing using synthetic
impedance: digitally simulated circuit elements integrated into an
otherwise analog circuit. This approach combines the physicality and classic character of analog audio circuits alongside the
precision and flexibility of digital signal processing (DSP). Our
impedance synthesis system consists of a voltage-controlled current source and a microcontroller-based DSP system. We demonstrate our technique through modifying an iconic guitar distortion pedal, the Boss DS-1, showing the ability of the synthetic
impedance to both replicate and extend the behaviour of the pedal’s
diode clipping stage. We discuss the behaviour of the synthetic
impedance in isolated laboratory conditions and in the DS-1 pedal,
highlighting the technical and creative potential of the technique as
well as its practical limitations and future extensions.
Download Estimation of Multi-Slope Amplitudes in Late Reverberation The common-slope model is used to model late reverberation of
complex room geometries such as multiple coupled rooms. The
model fits band-limited room impulse responses using a set of
common decay rates, with amplitudes varying based on listener
positions. This paper investigates amplitude estimation methods
within the common-slope model framework. We compare several traditional least squares estimation methods and propose using
LINEX regression, a Maximum Likelihood approach using logsquared RIR statistics. Through statistical analysis and simulation
tests, we demonstrate that LINEX regression improves accuracy
and reduces bias when compared to traditional methods.