Download Fusing Block-level Features for Music Similarity Estimation
In this paper we present a novel approach to computing music similarity based on block-level features. We first introduce three novel block-level features — the Variance Delta Spectral Pattern (VDSP), the Correlation Pattern (CP) and the Spectral Contrast Pattern (SCP). Then we describe how to combine the extracted features into a single similarity function. A comprehensive evaluation based on genre classification experiments shows that the combined block-level similarity measure (BLS) is comparable, in terms of quality, to the best current method from the literature. But BLS has the important advantage of being based on a vector space representation, which directly facilitates a number of useful operations, such as PCA analysis, k-means clustering, visualization etc. We also show that there is still potential for further improve of music similarity measures by combining BLS with another stateof-the-art algorithm; the combined algorithm then outperforms all other algorithms in our evaluation. Additionally, we discuss the problem of album and artist effects in the context of similaritybased recommendation and show that one can detect the presence of such effects in a given dataset by analyzing the nearest neighbor classification results.
Download An efficient audio time-scale modification algorithm for use in a subband implementation
The PAOLA algorithm is an efficient algorithm for the timescale modification of speech. It uses a simple peak alignment technique to synchronise synthesis frames and takes waveform properties and the desired time-scale factor into account to determine optimum algorithm parameters. However, PAOLA has difficulties with certain waveform types and can result in poor synchronisation for subband implementations. SOLA is a less efficient algorithm but resolves the issues associated with PAOLA’s implementation. We present an algorithm that is a combination of the two approaches that proves to be an efficient and effective algorithm for a subband implementation.
Download Effective Separation of Low-Pitch Notes Using NMF Using Non-Power-of-2 Discrete Fourier Transforms
Recently, non-negative matrix factorization (NMF), which is applied to decompose signals in frequency domain by means of short-time Fourier transform (STFT), is widely used in audio source separation. Separation of low-pitch notes in recordings is of significant interest. According to time-frequency uncertainty principle, it may suffer from the tradeoff between time and frequency localizations for low-pitch sounds. Furthermore, because the window function applied to the signal causes frequency spreading, separation of low-pitch notes becomes more difficult. Instead of using power-of-2 FFT, we experiment on STFT sizes corresponding to the pitches of the notes in the signals. Computer simulations using synthetic signals show that the Source to Interferences Ratio (SIR) is significantly improved without sacrificing Sources to Artifacts Ratio (SAR) and Source to Distortion Ratio (SDR). In average, at least 2 to 6 dB improvement in SIR is achieved when compared to power-of-2 FFT of similar sizes.
Download Inferring the hand configuration from hand clapping sounds
In this paper, a technique for inferring the configuration of a clapper’s hands from a hand clapping sound is described. The method was developed based on analysis of synthetic and recorded hand clap sounds, labeled with the corresponding hand configurations. A naïve Bayes classifier was constructed to automatically classify the data using two different feature sets. The results indicate that the approach is applicable for inferring the hand configuration.
Download Audio-visual Multiple Active Speaker Localization in Reverberant Environments
Localisation of multiple active speakers in natural environments with only two microphones is a challenging problem. Reverberation degrades the performance of speaker localisation based exclusively on directional cues. This paper presents an approach based on audio-visual fusion. The audio modality performs the multiple speaker localisation using the Skeleton method, energy weighting, and precedence effect filtering and weighting. The video modality performs the active speaker detection based on the analysis of the lip region of the detected speakers. The audio modality alone has problems with localisation accuracy, while the video modality alone has problems with false detections. The estimation results of both modalities are represented as probabilities in the azimuth domain. A Gaussian fusion method is proposed to combine the estimates in a late stage. As a consequence, the localisation accuracy and robustness compared to the audio/video modality alone is significantly increased. Experimental results in different scenarios confirmed the improved performance of the proposed method.
Download Adaptive Pitch-Shifting With Applications to Intonation Adjustment in a Cappella Recordings
A central challenge for a cappella singers is to adjust their intonation and to stay in tune relative to their fellow singers. During editing of a cappella recordings, one may want to adjust local intonation of individual singers or account for global intonation drifts over time. This requires applying a time-varying pitch-shift to the audio recording, which we refer to as adaptive pitch-shifting. In this context, existing (semi-)automatic approaches are either laborintensive or face technical and musical limitations. In this work, we present automatic methods and tools for adaptive pitch-shifting with applications to intonation adjustment in a cappella recordings. To this end, we show how to incorporate time-varying information into existing pitch-shifting algorithms that are based on resampling and time-scale modification (TSM). Furthermore, we release an open-source Python toolbox, which includes a variety of TSM algorithms and an implementation of our method. Finally, we show the potential of our tools by two case studies on global and local intonation adjustment in a cappella recordings using a publicly available multitrack dataset of amateur choral singing.
Download Automated rhythmic transformation of musical audio
Time-scale transformations of audio signals have traditionally relied exclusively upon manipulations of tempo. We present a novel technique for automatic mixing and synchronization between two musical signals. In this transformation, the original signal assumes the tempo, meter, and rhythmic structure of the model signal, while the extracted downbeats and salient intra-measure infrastructure of the original are maintained.
Download Real-time Finite Difference Physical Models of Musical Instruments on a Field Programmable Gate Array (FPGA)
Real-time sound synthesis of musical instruments based on solving differential equations is of great interest in Musical Acoustics especially in terms of linking geometry features of musical instruments to sound features. A major restriction of accurate physical models is the computational effort. One could state that the calculation cost is directly linked to the geometrical and material accuracy of a physical model and so to the validity of the results. This work presents a methodology for implementing realtime models of whole instrument geometries modelled with the Finite Differences Method (FDM) on a Field Programmable Gate Array (FPGA), a device capable of massively parallel computations. Examples of three real-time musical instrument implementations are given, a Banjo, a Violin and a Chinese Ruan.
Download State of the Art in Sound Texture Synthesis
The synthesis of sound textures, such as rain, wind, or crowds, is an important application for cinema, multimedia creation, games and installations. However, despite the clearly defined requirments of naturalness and flexibility, no automatic method has yet found widespread use. After clarifying the definition, terminology, and usages of sound texture synthesis, we will give an overview of the many existing methods and approaches, and the few available software implementations, and classify them by the synthesis model they are based on, such as subtractive or additive synthesis, granular synthesis, corpus-based concatenative synthesis, wavelets, or physical modeling. Additionally, an overview is given over analysis methods used for sound texture synthesis, such as segmentation, statistical modeling, timbral analysis, and modeling of transitions. 2
Download Time Scale Modification of Audio Using Non-Negative Matrix Factorization
This paper introduces an algorithm for time-scale modification of audio signals based on using non-negative matrix factorization. The activation signals attributed to the detected components are used for identifying sound events. The segmentation of these events is used for detecting and preserving transients. In addition, the algorithm introduces the possibility of preserving the envelopes of overlapping sound events while globally modifying the duration of an audio clip.