Download Upcylcing Android Phones into Embedded Audio Platforms There are millions of sophisticated Android phones in the world that get disposed of at a very high rate due to consumerism. Their computational power and built-in features, instead of being wasted when discarded, could be repurposed for creative applications such as musical instruments and interactive audio installations. However, audio programming on Android is complicated and comes with restrictions that heavily impact performance. To address this issue, we present LDSP, an open-source environment that can be used to easily upcycle Android phones into embedded platforms optimized for audio synthesis and processing. We conducted a benchmark study to compare the number of oscillators that can be run in parallel on LDSP with an equivalent audio app designed according to modern Android standards. Our study tested six phones ranging from 2014 to 2018 and running different Android versions. The results consistently demonstrate that LDSP provides a significant boost in performance, with some cases showing an increase of more than double, making even very old phones suitable for fairly advanced audio applications.
Download Frequency-Dependent Characteristics and Perceptual Validation of the Interaural Thresholded Level Distribution The interaural thresholded level distribution (ITLD) is a novel metric of auditory source width (ASW), derived from the psychophysical processes and structures of the inner ear. While several of the ITLD’s objective properties have been presented in previous work, its frequency-dependent characteristics and perceptual relationship with ASW have not been previously explored. This paper presents an investigation into these properties of the ITLD, which exhibits pronounced variation in band-limited behaviour as octaveband centre-frequency is increased. Additionally, a very strong correlation was found between [1 – ITLD] and normalised values of ASW, collected from a semantic differential listening test based on the Multiple Stimulus with Hidden Reference and Anchor (MUSHRA) framework. Perceptual relationships between various ITLD-derived quantities were also investigated, showing that the low-pass filter intrinsic to ITLD calculation strengthened the relationship between [1 – ITLD] and ASW. A subsequent test using transient stimuli, as well as investigations into other psychoacoustic properties of the metric such as its just-noticeabledifference, were outlined as subjects for future research, to gain a deeper understanding of the subjective properties of the ITLD.
Download Assisted Sound Sample Generation with Musical Conditioning in Adversarial Auto-Encoders Deep generative neural networks have thrived in the field of computer vision, enabling unprecedented intelligent image processes. Yet the results in audio remain less advanced and many applications are still to be investigated. Our project targets real-time sound synthesis from a reduced set of high-level parameters, including semantic controls that can be adapted to different sound libraries and specific tags. These generative variables should allow expressive modulations of target musical qualities and continuously mix into new styles. To this extent we train auto-encoders on an orchestral database of individual note samples, along with their intrinsic attributes: note class, timbre domain (an instrument subset) and extended playing techniques. We condition the decoder for explicit control over the rendered note attributes and use latent adversarial training for learning expressive style parameters that can ultimately be mixed. We evaluate both generative performances and correlations of the attributes with the latent representation. Our ablation study demonstrates the effectiveness of the musical conditioning. The proposed model generates individual notes as magnitude spectrograms from any probabilistic latent code samples (each latent point maps to a single note), with expressive control of orchestral timbres and playing styles. Its training data subsets can directly be visualized in the 3-dimensional latent representation. Waveform rendering can be done offline with the Griffin-Lim algorithm. In order to allow real-time interactions, we fine-tune the decoder with a pretrained magnitude spectrogram inversion network and embed the full waveform generation pipeline in a plugin. Moreover the encoder could be used to process new input samples, after manipulating their latent attribute representation, the decoder can generate sample variations as an audio effect would. Our solution remains rather light-weight and fast to train, it can directly be applied to other sound domains, including an user’s libraries with custom sound tags that could be mapped to specific generative controls. As a result, it fosters creativity and intuitive audio style experimentations. Sound examples and additional visualizations are available on Github1, as well as codes after the review process.
Download Differentiable All-Pole Filters for Time-Varying Audio Systems Infinite impulse response filters are an essential building block of many time-varying audio systems, such as audio effects and synthesisers. However, their recursive structure impedes end-toend training of these systems using automatic differentiation. Although non-recursive filter approximations like frequency sampling and frame-based processing have been proposed and widely used in previous works, they cannot accurately reflect the gradient of the original system. We alleviate this difficulty by reexpressing a time-varying all-pole filter to backpropagate the gradients through itself, so the filter implementation is not bound to the technical limitations of automatic differentiation frameworks. This implementation can be employed within audio systems containing filters with poles for efficient gradient evaluation. We demonstrate its training efficiency and expressive capabilities for modelling real-world dynamic audio systems on a phaser, time-varying subtractive synthesiser, and feed-forward compressor. We make our code and audio samples available and provide the trained audio effect and synth models in a VST plugin1 .
Download A Diffusion-Based Generative Equalizer for Music Restoration This paper presents a novel approach to audio restoration, focusing on the enhancement of low-quality music recordings, and in particular historical ones. Building upon a previous algorithm called BABE, or Blind Audio Bandwidth Extension, we introduce BABE-2, which presents a series of improvements. This research broadens the concept of bandwidth extension to generative equalization, a task that, to the best of our knowledge, has not been previously addressed for music restoration. BABE-2 is built around an optimization algorithm utilizing priors from diffusion models, which are trained or fine-tuned using a curated set of high-quality music tracks. The algorithm simultaneously performs two critical tasks: estimation of the filter degradation magnitude response and hallucination of the restored audio. The proposed method is objectively evaluated on historical piano recordings, showing an enhancement over the prior version. The method yields similarly impressive results in rejuvenating the works of renowned vocalists Enrico Caruso and Nellie Melba. This research represents an advancement in the practical restoration of historical music. Historical music restoration examples are available at: research.spa.aalto.fi/publications/papers/dafx-babe2/.
Download Design of FPGA-based High-order FDTD Method for Room Acoustics Sound field rendering with finite difference time domain (FDTD) method is computation-intensive and memory-intensive. This research investigates an FPGA-based acceleration system for sound field rendering with the high-order FDTD method, in which spatial and temporal blockings are applied to alleviate external memory bandwidth bottleneck and reuse data, respectively. After implemented by using the FPGA card DE10-Pro, the FPGA-based sound field rendering systems outperform the software simulations conducted on a desktop machine with 512 GB DRAMs and a Xeon Gold 6212U processor (24 cores) running at 2.4 GHz by 11 times, 13 times, and 18 times in computing performance in the case of the 2nd-order, 4th-order, and 6th-order FDTD schemes, respectively, even though the FPGA-based sound field rendering systems run at much lower clock frequency and have much smaller on-chip and external memory.
Download Audio De-Thumping using Huang s Empirical Mode Decomposition In the context of audio restoration, sound transfer of broken disks usually produces audio signals corrupted with long pulses of low-frequency content, also called thumps. This paper presents a method for audio de-thumping based on Huang’s Empirical Mode Decomposition (EMD), provided the pulse locations are known beforehand. Thus, the EMD is used as a means to obtain pulse estimates to be subtracted from the degraded signals. Despite its simplicity, the method is demonstrated to tackle well the challenging problem of superimposed pulses. Performance assessment against selected competing solutions reveals that the proposed solution tends to produce superior de-thumping results.
Download Analysis and Synthesis of the Violin Playing Style of Heifetz and Oistrakh The same music composition can be performed in different ways, and the differences in performance aspects can strongly change the expression and character of the music. Experienced musicians tend to have their own performance style, which reflects their personality, attitudes and beliefs. In this paper, we present a datadriven analysis of the performance style of two master violinists, Jascha Heifetz and David Fyodorovich Oistrakh to find out their differences. Specifically, from 26 gramophone recordings of each of these two violinists, we compute features characterizing performance aspects including articulation, energy, and vibrato, and then compare their style in terms of the accents and legato groups of the music. Based on our findings, we propose algorithms to synthesize violin audio solo recordings of these two masters from scores, for music compositions that we either have or have not observed in the analysis stage. To our best knowledge, this study represents the first attempt that computationally analyzes and synthesizes the playing style of master violinists.
Download Combining classifications based on local and global features: application to singer identification In this paper we investigate the problem of singer identification on acapella recordings of isolated notes. Most of studies on singer identification describe the content of signals of singing voice with features related to the timbre (such as MFCC or LPC). These features aim to describe the behavior of frequencies at a given instant of time (local features). In this paper, we propose to describe sung tone with the temporal variations of the fundamental frequency (and its harmonics) of the note. The periodic and continuous variations of the frequency trajectories are analyzed on the whole note and the features obtained reflect expressive and intonative elements of singing such as vibrato, tremolo and portamento. The experiments, conducted on two distinct data-sets (lyric and pop-rock singers), prove that the new set of features capture a part of the singer identity. However, these features are less accurate than timbre-based features. We propose to increase the recognition rate of singer identification by combining information conveyed by local and global description of notes. The proposed method, that shows good results, can be adapted for classification problem involving a large number of classes, or to combine classifications with different levels of performance.
Download First-Order Ambisonic Coding with PCA Matrixing and Quaternion-Based Interpolation We present a spatial audio coding method which can extend existing speech/audio codecs, such as EVS or Opus, to represent first-order ambisonic (FOA) signals at low bit rates. The proposed method is based on principal component analysis (PCA) to decorrelate ambisonic components prior to multi-mono coding. The PCA rotation matrices are quantized in the generalized Euler angle domain; they are interpolated in quaternion domain to avoid discontinuities between successive signal blocks. We also describe an adaptive bit allocation algorithm for an optimized multi-mono coding of principal components. A subjective evaluation using the MUSHRA methodology is presented to compare the performance of the proposed method with naive multi-mono coding using a fixed bit allocation. Results show significant quality improvements at bit rates in the range of 52.8 kbit/s (4 × 13.2) to 97.6 kbit/s (4 × 24.4) using the EVS codec.