Download GPGPU Audio Benchmark Framework
Acceleration of audio workloads on generally-programmable GPU (GPGPU) hardware offers potentially high speedup factors, but also presents challenges in terms of development and deployment. We can increasingly depend on such hardware being available in users’ systems, yet few real-time audio products use this resource. We propose a suite of benchmarks to qualify a GPU as suitable for batch or real-time audio processing. This includes both microbenchmarks and higher-level audio domain benchmarks. We choose metrics based on application, paying particularly close attention to latency tail distribution. We propose an extension to the benchmark framework to more accurately simulate the real-world request pattern and performance requirements when running in a digital audio workstation. We run these benchmarks on two common consumer-level platforms: a PC desktop with a recent midrange discrete GPU and a Macintosh desktop with unified CPUGPU memory architecture.
Download Comparing Acoustic and Digital Piano Actions: Data Analysis and Key Insights
The acoustic piano and its sound production mechanisms have been extensively studied in the field of acoustics. Similarly, digital piano synthesis has been the focus of numerous signal processing research studies. However, the role of the piano action in shaping the dynamics and nuances of piano sound has received less attention, particularly in the context of digital pianos. Digital pianos are well-established commercial instruments that typically use weighted keys with two or three sensors to measure the average key velocity—this being the only input to a sampling synthesis engine. In this study, we investigate whether this simplified measurement method adequately captures the full dynamic behavior of the original piano action. After a brief review of the state of the art, we describe an experimental setup designed to measure physical properties of the keys and hammers of a piano. This setup enables high-precision readings of acceleration, velocity, and position for both the key and hammer across various dynamic levels. Through extensive data analysis, we examine their relationships and identify the optimal key position for velocity measurement. We also analyze a digital piano key to determine where the average key velocity is measured and compare it with our proposed optimal timing. We find that the instantaneous key velocity just before let-off correlates most strongly with hammer impact velocity, indicating a target for improved sensing; however, due to the limitations of discrete velocity sensing this optimization alone may not suffice to replicate the nuanced expressiveness of acoustic piano touch. This study represents the first step in a broader research effort aimed at linking piano touch, dynamics, and sound production.
Download Digital Morphophone Environment. Computer Rendering of a Pioneering Sound Processing Device
This paper introduces a digital reconstruction of the morphophone, a complex magnetophonic device developed in the 1950s within the laboratories of the GRM (Groupe de Recherches Musicales) in Paris. The analysis, design, and implementation methodologies underlying the Digital Morphophone Environment are discussed. Based on a detailed review of historical sources and limited documentation – including a small body of literature and, most notably, archival images – the core operational principles of the morphophone have been modeled within the MAX visual programming environment. The main goals of this work are, on the one hand, to study and make accessible a now obsolete and unavailable tool, and on the other, to provide the opportunity for new explorations in computer music and research.
Download Towards an Objective Comparison of Panning Feature Algorithms for Unsupervised Learning
Estimations of panning attributes are an important feature to extract from a piece of recorded music, with downstream uses such as classification, quality assessment, and listening enhancement. While several algorithms exist in the literature, there is currently no comparison between them and no studies to suggest which one is most suitable for any particular task. This paper compares four algorithms for extracting amplitude panning features with respect to their suitability for unsupervised learning. It finds synchronicities between them and analyses their results on a small set of commercial music excerpts chosen for their distinct panning features. The ability of each algorithm to differentiate between the tracks is analysed. The results can be used in future work to either select the most appropriate panning feature algorithm or create a version customized for a particular task.
Download Shifted NMF with Group Sparsity for Clustering NMF Basis Functions
Recently, Non-negative Matrix Factorisation (NMF) has found application in separation of individual sound sources. NMF decomposes the spectrogram of an audio mixture into an additive parts based representation where the parts typically correspond to individual notes or chords. However, there is a need to cluster the NMF basis functions to their sources. Although, many attempts have been made to improve the clustering of the basis functions to sources, much research is still required in this area. Recently, Shifted Non-negative Matrix Factorisation (SNMF) was used to cluster these basis functions. To this end, we propose that the incorporation of group sparsity to the Shifted NMF based methods may benefit the clustering algorithms. We have tested this on SNMF algorithms with improved separation quality. Results show that this gives improved clustering of pitched basis functions over previous methods.
Download Real-time Finite Difference Physical Models of Musical Instruments on a Field Programmable Gate Array (FPGA)
Real-time sound synthesis of musical instruments based on solving differential equations is of great interest in Musical Acoustics especially in terms of linking geometry features of musical instruments to sound features. A major restriction of accurate physical models is the computational effort. One could state that the calculation cost is directly linked to the geometrical and material accuracy of a physical model and so to the validity of the results. This work presents a methodology for implementing realtime models of whole instrument geometries modelled with the Finite Differences Method (FDM) on a Field Programmable Gate Array (FPGA), a device capable of massively parallel computations. Examples of three real-time musical instrument implementations are given, a Banjo, a Violin and a Chinese Ruan.
Download Characterisation of Acoustic Scenes Using a Temporally-constrained Shift-invariant Model
In this paper, we propose a method for modeling and classifying acoustic scenes using temporally-constrained shift-invariant probabilistic latent component analysis (SIPLCA). SIPLCA can be used for extracting time-frequency patches from spectrograms in an unsupervised manner. Component-wise hidden Markov models are incorporated to the SIPLCA formulation for enforcing temporal constraints on the activation of each acoustic component. The time-frequency patches are converted to cepstral coefficients in order to provide a compact representation of acoustic events within a scene. Experiments are made using a corpus of train station recordings, classified into 6 scene classes. Results show that the proposed model is able to model salient events within a scene and outperforms the non-negative matrix factorization algorithm for the same task. In addition, it is demonstrated that the use of temporal constraints can lead to improved performance.
Download The Tonalness Spectrum: Feature-Based Estimation of Tonal Components
The tonalness spectrum shows the likelihood of a spectral bin being part of a tonal or non-tonal component. It is a non-binary measure based on a set of established spectral features. An easily extensible framework for the computation, selection, and combination of features is introduced. The results are evaluated and compared in two ways. First with a data set of synthetically generated signals but also with real music signals in the context of a typical MIR application.
Download Unsupervised Audio Key and Chord Recognition
This paper presents a new methodology for determining chords of a music piece without using training data. Specifically, we introduce: 1) a wavelet-based audio denoising component to enhance a chroma-based feature extraction framework, 2) an unsupervised key recognition component to extract a bag of local keys, 3) a chord recognizer using estimated local keys to adjust the chromagram based on a set of well-known tonal profiles to recognize chords on a frame-by-frame basis. We aim to recognize 5 classes of chords (major, minor, diminished, augmented, suspended) and 1 N (no chord or silence). We demonstrate the performance of the proposed approach using 175 Beatles’ songs which we achieved 75% in F-measure for estimating a bag of local keys and at least 68.2% accuracy on chords without discarding any audio segments or the use of other musical elements. The experimental results also show that the wavelet-based denoiser improves the chord recognition rate by approximately 4% over that of other chroma features.
Download On the window-disjoint-orthogonality of speech sources in reverberant humanoid scenarios
Many speech source separation approaches are based on the assumption of orthogonality of speech sources in the time-frequency domain. The target speech source is demixed from the mixture by applying the ideal binary mask to the mixture. The time-frequency orthogonality of speech sources is investigated in detail only for anechoic and artificially mixed speech mixtures. This paper evaluates how the orthogonality of speech sources decreases when using a realistic reverberant humanoid recording setup and indicates strategies to enhance the separation capabilities of algorithms based on ideal binary masks under these conditions. It is shown that the SIR of the target source demixed from the mixture using the ideal binary mask decreases by approximately 3 dB for reverberation times of T60 = 0.6 s opposed to the anechoic scenario. For humanoid setups, the spatial distribution of the sources and the choice of the correct ear channel introduces differences in the SIR of further 3 dB, which leads to specific strategies to choose the best channel for demixing.