DAFx Paper Archive - Search for neural, page 13 of 35

Transition-Aware: A More Robust Approach for Piano Transcription

Xianke Wang; Wei Xu; Juanting Liu; Weiming Yang; Wenqing Cheng

DAFx-2021 - Vienna (virtual)

Piano transcription is a classic problem in music information retrieval. More and more transcription methods based on deep learning have been proposed in recent years. In 2019, Google Brain published a larger piano transcription dataset, MAESTRO. On this dataset, Onsets and Frames transcription approach proposed by Hawthorne achieved a stunning onset F1 score of 94.73%. Unlike the annotation method of Onsets and Frames, Transition-aware model presented in this paper annotates the attack process of piano signals called atack transition in multiple frames, instead of only marking the onset frame. In this way, the piano signals around onset time are taken into account, enabling the detection of piano onset more stable and robust. Transition-aware achieves a higher transcription F1 score than Onsets and Frames on MAESTRO dataset and MAPS dataset, reducing many extra note detection errors. This indicates that Transition-aware approach has better generalization ability on different datasets.

Download

An Audio-Visual Fusion Piano Transcription Approach Based on Strategy

Xianke Wang; Wei Xu; Juanting Liu; Weiming Yang; Wenqing Cheng

DAFx-2021 - Vienna (virtual)

Piano transcription is a fundamental problem in the field of music information retrieval. At present, a large number of transcriptional studies are mainly based on audio or video, yet there is a small number of discussion based on audio-visual fusion. In this paper, a piano transcription model based on strategy fusion is proposed, in which the transcription results of the video model are used to assist audio transcription. Due to the lack of datasets currently used for audio-visual fusion, the OMAPS data set is proposed in this paper. Meanwhile, our strategy fusion model achieves a 92.07% F1 score on OMAPS dataset. The transcription model based on feature fusion is also compared with the one based on strategy fusion. The experiment results show that the transcription model based on strategy fusion achieves better results than the one based on feature fusion.

Download

Unsupervised Text-to-Sound Mapping via Embedding Space Alignment

Luke Dzwonczyk; Carmine-Emanuele Cella

DAFx-2025 - Ancona

This work focuses on developing an artistic tool that performs an unsupervised mapping between text and sound, converting an input text string into a series of sounds from a given sound corpus. With the use of a pre-trained sound embedding model and a separate, pre-trained text embedding model, the goal is to find a mapping between the two feature spaces. Our approach is unsupervised which allows any sound corpus to be used with the system. The tool performs the task of text-to-sound retrieval, creating a soundfile in which each word in the text input is mapped to a single sound in the corpus, and the resulting sounds are concatenated to play sequentially. We experiment with three different mapping methods, and perform quantitative and qualitative evaluations on the outputs. Our results demonstrate the potential of unsupervised methods for creative applications in text-to-sound mapping.

Download

Identification of Nonlinear Circuits as Port-Hamiltonian Systems

Judy Najnudel; Rémy Müller; Thomas Hélie; David Roze

DAFx-2021 - Vienna (virtual)

This paper addresses identification of nonlinear circuits for power-balanced virtual analog modeling and simulation. The proposed method combines a port-Hamiltonian system formulation with kernel-based methods to retrieve model laws from measurements. This combination allows for the estimated model to retain physical properties that are crucial for the accuracy of simulations, while representing a variety of nonlinear behaviors. As an illustration, the method is used to identify a nonlinear passive peaking EQ.

Download

Expressive Piano Performance Rendering from Unpaired Data

Lenny Renault; Rémi Mignot; Axel Roebel

DAFx-2023 - Copenhagen

Recent advances in data-driven expressive performance rendering have enabled automatic models to reproduce the characteristics and the variability of human performances of musical compositions. However, these models need to be trained with aligned pairs of scores and performances and they rely notably on score-specific markings, which limits their scope of application. This work tackles the piano performance rendering task in a low-informed setting by only considering the score note information and without aligned data. The proposed model relies on an adversarial training where the basic score notes properties are modified in order to reproduce the expressive qualities contained in a dataset of real performances. First results for unaligned score-to-performance rendering are presented through a conducted listening test. While the interpretation quality is not on par with highly-supervised methods and human renditions, our method shows promising results for transferring realistic expressivity into scores.

Download

A Structural Similarity Index Based Method to Detect Symbolic Monophonic Patterns in Real-Time

Nishal Silva; Luca Turchet

DAFx-2022 - Vienna

Automatic detection of musical patterns is an important task in the field of Music Information Retrieval due to its usage in multiple applications such as automatic music transcription, genre or instrument identification, music classification, and music recommendation. A significant sub-task in pattern detection is the realtime pattern detection in music due to its relevance in application domains such as the Internet of Musical Things. In this study, we present a method to identify the occurrence of known patterns in symbolic monophonic music streams in real-time. We introduce a matrix-based representation to denote musical notes using its pitch, pitch-bend, amplitude, and duration. We propose an algorithm based on an independent similarity index for each note attribute. We also introduce the Match Measure, which is a numerical value signifying the degree of the match between a pattern and a sequence of notes. We have tested the proposed algorithm against three datasets: a human recorded dataset, a synthetically designed dataset, and the JKUPDD dataset. Overall, a detection rate of 95% was achieved. The low computational load and minimal running time demonstrate the suitability of the method for real-world, real-time implementations on embedded systems.

Download

Partiels – Exploring, Analyzing and Understanding Sounds

Pierre Guillot

DAFx-2025 - Ancona

This article presents Partiels, an open-source application developed at IRCAM to analyze digital audio files and explore sound characteristics. The application uses Vamp plug-ins to extract various information on different aspects of the sound, such as spectrum, partials, pitch, tempo, text, and chords. Partiels is the successor to AudioSculpt, offering a modern, flexible interface for visualizing, editing, and exporting analysis results, addressing a wide range of issues from musicological practice to sound creation and signal processing research. The article describes Partiels’ key features, including analysis organization, audio file management, results visualization and editing, as well as data export and sharing options, and its interoperability with other software such as Max and Pure Data. In addition, it highlights the numerous analysis plug-ins developed at IRCAM, based in particular on machine learning models, as well as the IRCAM Vamp extension, which overcomes certain limitations of the original Vamp format.

Download

Discrete Implementation Of The First Order System Cascade As The Basis For A Melodic Segmentation Model

Serman M.

DAFx-2001 - Limerick

The basis for a low-level melodic segmentation model and its discrete implementation is presented. The model is based on the discrete approximation of the one-dimensional convective transport mechanism. In this way, a physically plausible mechanism for achieving multi-scale representation is obtained. Some aspects of edge detection theory thought to be relevant for solving similar problems in auditory perception are briefly introduced. Two examples presenting the dynamic behaviour of the model are shown.

Download

Polyphonic music analysis by signal processing and support vector machines

Ruohua Zhou; Giorgio Zoia

DAFx-2005 - Madrid

In this paper an original system for the analysis of harmony and polyphonic music is introduced. The system is based on signal processing and machine learning. A new multi-resolution, fast analysis method is conceived to extract time-frequency energy spectrum at the signal processing stage, while support vector machine is used as machine learning technology. Aiming at the analysis of rather general audio content, experiments are made on a huge set of recorded samples, using 19 music instruments combined together or alone, with different polyphony. Experimental results show that fundamental frequencies are detected with a remarkable success ratio and that the method can provide excellent results in general cases.

Download

Vocal Tract Area Estimation by Gradient Descent

David Südholt; Mateo Cámara; Zhiyuan Xu; Joshua D. Reiss

DAFx-2023 - Copenhagen

Articulatory features can provide interpretable and flexible controls for the synthesis of human vocalizations by allowing the user to directly modify parameters like vocal strain or lip position. To make this manipulation through resynthesis possible, we need to estimate the features that result in a desired vocalization directly from audio recordings. In this work, we propose a white-box optimization technique for estimating glottal source parameters and vocal tract shapes from audio recordings of human vowels. The approach is based on inverse filtering and optimizing the frequency response of a waveguide model of the vocal tract with gradient descent, propagating error gradients through the mapping of articulatory features to the vocal tract area function. We apply this method to the task of matching the sound of the Pink Trombone, an interactive articulatory synthesizer, to a given vocalization. We find that our method accurately recovers control functions for audio generated by the Pink Trombone itself. We then compare our technique against evolutionary optimization algorithms and a neural network trained to predict control parameters from audio. A subjective evaluation finds that our approach outperforms these black-box optimization baselines on the task of reproducing human vocalizations.

Download

Proceedings of the International Conference on Digital Audio Effects (DAFx)

Proc. Int. Conf. Digital Audio Effects (DAFx)

Paper Archive

Years

Authors