Download Assisted Sound Sample Generation with Musical Conditioning in Adversarial Auto-Encoders Deep generative neural networks have thrived in the field of computer vision, enabling unprecedented intelligent image processes. Yet the results in audio remain less advanced and many applications are still to be investigated. Our project targets real-time sound synthesis from a reduced set of high-level parameters, including semantic controls that can be adapted to different sound libraries and specific tags. These generative variables should allow expressive modulations of target musical qualities and continuously mix into new styles. To this extent we train auto-encoders on an orchestral database of individual note samples, along with their intrinsic attributes: note class, timbre domain (an instrument subset) and extended playing techniques. We condition the decoder for explicit control over the rendered note attributes and use latent adversarial training for learning expressive style parameters that can ultimately be mixed. We evaluate both generative performances and correlations of the attributes with the latent representation. Our ablation study demonstrates the effectiveness of the musical conditioning. The proposed model generates individual notes as magnitude spectrograms from any probabilistic latent code samples (each latent point maps to a single note), with expressive control of orchestral timbres and playing styles. Its training data subsets can directly be visualized in the 3-dimensional latent representation. Waveform rendering can be done offline with the Griffin-Lim algorithm. In order to allow real-time interactions, we fine-tune the decoder with a pretrained magnitude spectrogram inversion network and embed the full waveform generation pipeline in a plugin. Moreover the encoder could be used to process new input samples, after manipulating their latent attribute representation, the decoder can generate sample variations as an audio effect would. Our solution remains rather light-weight and fast to train, it can directly be applied to other sound domains, including an user’s libraries with custom sound tags that could be mapped to specific generative controls. As a result, it fosters creativity and intuitive audio style experimentations. Sound examples and additional visualizations are available on Github1, as well as codes after the review process.
Download First-Order Ambisonic Coding with PCA Matrixing and Quaternion-Based Interpolation We present a spatial audio coding method which can extend existing speech/audio codecs, such as EVS or Opus, to represent first-order ambisonic (FOA) signals at low bit rates. The proposed method is based on principal component analysis (PCA) to decorrelate ambisonic components prior to multi-mono coding. The PCA rotation matrices are quantized in the generalized Euler angle domain; they are interpolated in quaternion domain to avoid discontinuities between successive signal blocks. We also describe an adaptive bit allocation algorithm for an optimized multi-mono coding of principal components. A subjective evaluation using the MUSHRA methodology is presented to compare the performance of the proposed method with naive multi-mono coding using a fixed bit allocation. Results show significant quality improvements at bit rates in the range of 52.8 kbit/s (4 × 13.2) to 97.6 kbit/s (4 × 24.4) using the EVS codec.