Download Increasing Drum Transcription Vocabulary Using Data Synthesis
Current datasets for automatic drum transcription (ADT) are small and limited due to the tedious task of annotating onset events. While some of these datasets contain large vocabularies of percussive instrument classes (e.g. ~20 classes), many of these classes occur very infrequently in the data. This paucity of data makes it difficult to train models that support such large vocabularies. Therefore, data-driven drum transcription models often focus on a small number of percussive instrument classes (e.g. 3 classes). In this paper, we propose to support large-vocabulary drum transcription by generating a large synthetic dataset (210,000 eight second examples) of audio examples for which we have groundtruth transcriptions. Using this synthetic dataset along with existing drum transcription datasets, we train convolutional-recurrent neural networks (CRNNs) in a multi-task framework to support large-vocabulary ADT. We find that training on both the synthetic and real music drum transcription datasets together improves performance on not only large-vocabulary ADT, but also beat / downbeat detection small-vocabulary ADT.
Download Automatic drum transcription with convolutional neural networks
Automatic drum transcription (ADT) aims to detect drum events in polyphonic music. This task is part of the more general problem of transcribing a music signal in terms of its musical score and additionally can be very interesting for extracting high level information e.g. tempo, downbeat, measure. This article has the objective to investigate the use of Convolutional Neural Networks (CNN) in the context of ADT. Two different strategies are compared. First an approach based on a CNN based detection of drum only onsets is combined with an algorithm using Non-negative Matrix Deconvolution (NMD) for drum onset transcription. Then an approach relying entirely on CNN for the detection of individual drum instruments is described. The question of which loss function is the most adapted for this task is investigated together with the question of the optimal input structure. All algorithms are evaluated using the publicly available ENST Drum database, a widely used established reference dataset, allowing easy comparison with other algorithms. The comparison shows that the purely CNN based algorithm significantly outperforms the NMD based approach, and that the results are significantly better for the snare drum, but slightly worse for both the bass drum and the hi-hat when compared to the best results published so far and ones using also a neural network model.
Download Optimized Velvet-Noise Decorrelator
Decorrelation of audio signals is a critical step for spatial sound reproduction on multichannel configurations. Correlated signals yield a focused phantom source between the reproduction loudspeakers and may produce undesirable comb-filtering artifacts when the signal reaches the listener with small phase differences. Decorrelation techniques reduce such artifacts and extend the spatial auditory image by randomizing the phase of a signal while minimizing the spectral coloration. This paper proposes a method to optimize the decorrelation properties of a sparse noise sequence, called velvet noise, to generate short sparse FIR decorrelation filters. The sparsity allows a highly efficient time-domain convolution. The listening test results demonstrate that the proposed optimization method can yield effective and colorless decorrelation filters. In comparison to a white noise sequence, the filters obtained using the proposed method preserve better the spectrum of a signal and produce good quality broadband decorrelation while using 76% fewer operations for the convolution. Satisfactory results can be achieved with an even lower impulse density which decreases the computational cost by 88%.
Download Surround Sound without Rear Loudspeakers: Multichannel Compensated Amplitude Panning and Ambisonics
Conventional panning approaches for surround sound require loudspeakers to be distributed over the regions where images are needed. However in many listening situations it is not practical or desirable to place loudspeakers some positions, such as behind or above the listener. Compensated Amplitude Panning (CAP) is a method that adapts dynamically to the listener’s head orientation to provide images in any direction, in the frequency range up to ⇡ 1000 Hz using only 2 loudspeakers. CAP is extended here for more loudspeakers, which removes some limitations and provides additional benefits. The new CAP method is also compared with an Ambisonics approach that is adapted for surround sound without rear loudspeakers.
Download A Feedback Canceling Reverberator
A real-time auralization system is described in which room sounds are reverberated and presented over loudspeakers. Room microphones are used to capture room sound sources, with their outputs processed in a canceler to remove the synthetic reverberation also present in the room. Doing so suppresses feedback and gives precise control over the auralization. It also allows freedom of movement and creates a more dynamic acoustic environment for performers or participants in music, theater, gaming, and virtual reality applications. Canceler design methods are discussed, including techniques for handling varying loudspeaker-microphone transfer functions such as would be present in the context of a performance or installation. Tests in a listening room and recital hall show in excess of 20 dB of feedback suppression.
Download Efficient signal extrapolation by granulation and convolution with velvet noise
Several methods are available nowadays to artificially extend the duration of a signal for audio restoration or creative music production purposes. The most common approaches include overlap-andadd (OLA) techniques, FFT-based methods, and linear predictive coding (LPC). In this work we describe a novel OLA algorithm based on convolution with velvet noise, in order to exploit its sparsity and spectrum flatness. The proposed method suppresses spectral coloration and achieves remarkable computational efficiency. Its issues are addressed and some design choices are explored. Experimental results are proposed and compared to a well-known FFT-based method.
Download Improving intelligibility prediction under informational masking using an auditory saliency model
The reduction of speech intelligibility in noise is usually dominated by energetic masking (EM) and informational masking (IM). Most state-of-the-art objective intelligibility measures (OIM) estimate intelligibility by quantifying EM. Few measures model the effect of IM in detail. In this study, an auditory saliency model, which intends to measure the probability of the sources obtaining auditory attention in a bottom-up process, was integrated into an OIM for improving the performance of intelligibility prediction under IM. While EM is accounted for by the original OIM, IM is assumed to arise from the listener’s attention switching between the target and competing sounds existing in the auditory scene. The performance of the proposed method was evaluated along with three reference OIMs by comparing the model predictions to the listener word recognition rates, for different noise maskers, some of which introduce IM. The results shows that the predictive accuracy of the proposed method is as good as the best reported in the literature. The proposed method, however, provides a physiologically-plausible possibility for both IM and EM modelling.
Download Acoustic Assessment Of A Classroom And Rehabilitation Guided By Simulation
The acoustics of spaces whose purpose is the acoustic communication through speech, namely classrooms, is a subject that has not been given the due importance in architectural projects, with consequences in the existence of adverse acoustic conditions, which affect on a daily basis the learning of the students and the well-being of teachers. One of the lecture rooms of the Faculty of Engineering of the University of Porto (FEUP) was chosen, with a criterion of generality, in which the acoustic conditions were evaluated and compared with those that are known to be necessary for the intended acoustic communication effect. Several measurements were made in the space to investigate the acoustic parameters situation relatively to the appropriate range. An acoustic model of the amphitheater under study was developed in the EASE software, with which it was possible to obtain simulated results for comparison with the previously measured parameters and to introduce changes in the model to perceive their impact in the real space. In this phase it was possible to use the auralization resources of the software to create perception of how the sound is heard in the built model. This was useful for the phase of rehabilitation of the space because it was possible to judge subjectively the improvement of the sound intelligibility in that space. Finally, possible solutions are presented in the acoustic domain and using electroacoustic sound reinforcement aiming to provide a better acoustic comfort and communicational effectiveness for the people who use it.
Download Using Semantic Differential Scales To Assess The Subjective Perception Of Auditory Warning Signals
The relationship between physical acoustic parameters and the subjective responses they evoke is important to assess in audio alarm design. While the perception of urgency has been thoroughly investigated, the perception of other variables such as pleasantness, negativeness and irritability has not. To characterize the psychological correlates of variables such as frequency, speed, rhythm and onset, twenty-six participants evaluated fifty-four audio warning signals according to six different semantic differential scales. Regression analysis showed that speed predicted mostly the perception of urgency, preoccupation and negativity; frequency predicted the perception of pleasantness and irritability; and rhythm affected the perception of urgency. No correlation was found with onset and offset times. These findings are important to human-centred design recommendations for auditory warning signals.
Download Soundscape auralisation and visualisation: A cross-modal approach to Soundscape evaluation
Soundscape research is concerned with the study and understanding of our relationship with our surrounding acoustic environments and the sonic elements that they are comprised of. Whilst much of this research has focussed on sound alone, any practical application of soundscape methodologies should consider the interaction between aural and visual environmental features: an interaction known as cross-modal perception. This presents an avenue for soundscape research exploring how an environment’s visual features can affect an individual’s experience of the soundscape of that same environment. This paper presents the results of two listening tests1 : one a preliminary test making use of static stereo UHJ renderings of first-order-ambisonic (FOA) soundscape recordings and static panoramic images; the other using YouTube as a platform to present dynamic binaural renderings of the same FOA recordings alongside full motion spherical video. The stimuli for these tests were recorded at several locations around the north of England including rural, urban, and suburban environments exhibiting soundscapes comprised of many natural, human, and mechanical sounds. The purpose of these tests was to investigate how the presence of visual stimuli can alter soundscape perception and categorisation. This was done by presenting test subjects with each soundscape alone and then with visual accompaniment, and then comparing collected subjective evaluation data. Results indicate that the presence of certain visual features can alter the emotional state evoked by exposure to a soundscape, for example, where the presence of ‘green infrastructure’ (parks, trees, and foliage) results in a less agitating experience of a soundscape containing high levels of environmental noise. This research represents an important initial step toward the integration of virtual reality technologies into soundscape research, and the use of suitable tools to perform subjective evaluation of audiovisual stimuli. Future research will consider how these methodologies can be implemented in real-world applications.