SOUNDS Workshop 2024
You are cordially invited to participate in the SOUNDS workshop, a half-day satellite workshop, to the International Workshop on Acoustic Signal Enhancement (IWAENC) 2024, Monday September 9th.
Service-Oriented, Ubiquitous, Network-Driven Sound (SOUNDS) is a Marie Skłodowska-Curie Actions European Training Network (MSCA-ETN) that intends to reshape how we interact with sounds that can be captured, processed, and reproduced from the ubiquity of mobile and wearable devices in our everyday environments. The SOUNDS team consists of academic and non-academic participants: KU Leuven, Aalborg University, Imperial College London, Carl von Ossietzky University of Oldenburg, Bang and Olufsen, Nuance Communications Inc., Oticon A/S, Cerence GmbH, Televic Conference, CEDAR Audio Ltd., and Fraunhofer Institute for Digital Media Technology.
The primary aim of the workshop is to present and demonstrate the recent research results acquired within the SOUNDS project.
Program
14:00 – 14:05 Welcome
14:05 – 14.30 Overview of results and contributions of the SOUNDS project (Prof. Toon van Waterschoot, KU Leuven)
14:30 – 15:30 Keynote talk: “Spatial Audio for Virtual and Augmented Reality Devices” by Ivan J. Tashev, Partner Software Architect at Microsoft Research.
15:30 – 15:45 Coffee, cake, and networking break
15:45 – 17:45 SOUNDS poster and demo presentations
Signing-up
Participation in this workshop is free of charge if you are registered for IWAENC 2024. At the time of registration for IWAENC, it will also be possible to sign up for free to the satellite events. Sign up is required to help us plan the catering and room size.
You are most welcome to distribute this invitation to other relevant people in your organization.
Date & Venue
September 9, 2024
Musikkens Hus
Musikkens Pl. 1, 9000 Aalborg
Contact Info
Prof. Patrick Naylor (p.naylor@imperial.ac.uk)
Prof. Simon Doclo (simon.doclo@uni-oldenburg.de)
Prof. Jan Østergaard (jo@es.aau.dk)
Titles and abstract of the poster presentations
User-driven microphone selection method for hearing assistive systems
Vasudha Sathyapriyan(1,2), Michael Syskind Pedersen(1), Jan Østergaard(2), Mike Brookes(3), Patrick Naylor(3), Jesper Jensen(1,2)
(1) Demant A/S, Smørum, Denmark,
(2) Department of Electronic Systems, Aalborg University, Aalborg, Denmark,
(3) Department of Electrical and Electronic Engineering, Imperial College London, London, UK
Abstract
Remote microphones (RMs) are now being employed by hearing aids (HAs) to enhance the desired signal. The benefit offered by the RMs is primarily due to their potential to be deployed closer to the desired source, which enables it to capture a high-quality signal from the desired source, in contrast to the HA microphones. This improves the speech enhancement ability of the HAs, especially in very noisy environments. However, the quality of the desired signal captured by the RMs can vary based on their location and distance from the desired source. Consequently, all the RMs accessible by the HAs generally do not contribute equally to the desired speech enhancement task. Furthermore, using all the RMs deployed in the room would increase the consumption of resources such as, computational power, transmission power, bandwidth, which in turn would reduce the battery lifetime of HAs. Therefore, it is important to determine – for any particular acoustic situation – a subset of RMs that benefit the HAs.
Methods proposed in literature typically use utility functions that compares the cost of including/excluding a RM against the cost function related to a HA processing task. However, these methods are typically not designed for acoustic situations where there are competing noise sources present. In this case, additional prior knowledge is needed to qualify a source as the desired source in the presence of competing sources. In contrast to this, in this work, we propose a user-guided RM selection strategy. Specifically, by performing microphone selection that is guided by the user’s head movements and using the HA microphones in collaboration with the RMs, we demonstrate that the proposed method is effectively able to select the set of RMs which best capture the desired signal, even in the presence of competing speakers.
Enhancing Robustness of Distributed Audio Signal Estimation
Paul Didier(1), Toon van Waterschoot(1), Simon Doclo(2), Jörg Bitzer(3), Marc Moonen(1)
(1)STADIUS Center for Dynamical Systems, Department of Electrical Engineering (ESAT), KU Leuven, 3001 Leuven, Belgium.
(2)Signal Processing Group, Department of Medical Physics and Acoustics and Cluster of Excellence Hearing4all, University of Oldenburg, 26111 Oldenburg, Germany.
(3)Fraunhofer IDMT, Project Group Hearing, Speech and Audio Technology, Oldenburg, Germany.
Abstract
This poster presents an overview of Paul Didier’s ongoing PhD research which focuses on strategies to improve the reliability of distributed audio signal processing algorithms. Distributed systems, by their nature, encounter various challenges including network congestion, latency, and node failures. Addressing these robustness issues is crucial for ensuring seamless operation in real-world applications. Through the use of fused data exchange mechanisms, sampling rate offset (SRO) mitigation, and topology-independent algorithmic design, this research explores avenues to enhance the reliability and resilience of distributed algorithms.
A central object of investigation is the distributed adaptive node-specific signal estimation (DANSE) algorithm, which allows nodes a wireless acoustic sensor network (WASN) to distributedly reach the centralized target signal estimation performance while only exchanging low-dimensional versions of their multichannel sensor signals. Improvements are proposed to account for asynchronicities between nodes in a fully connected WASN, estimating and compensating for SROs in an adaptive fashion. The proposed solution allows to retain an efficient filter-bank frequency-domain implementation by allowing broadcasts of an arbitrary number of samples at a time. Another contribution modifies the topology-independent (TI) version of the DANSE algorithm (TI-DANSE) to ensure numerically stable functioning over long periods of time. Recent and ongoing efforts are also presented, reaching beyond algorithm refinement to address some of the basic signal model assumptions vital for DANSE convergence. Another comprehensive view of DANSE is also proposed, merging the versions applicable in fully connected and ad-hoc topology into a single, more generalizable definition with its own advantages.
Integrative Approaches to Enhancing Privacy in Speech Technology
Francesco Nespoli(1,3), Jule Pohlhausen(4), Michele Panariello(5), Daniel Barreda(1), Joerg Bitzer(2,4) and Patrick A. Naylor(3)
(1)Microsoft, London, UK
(2)Fraunhofer Institute for Digital Media Technology IDMT, Oldenburg, Germany 3
(3)Imperial College, London, UK
(4)Institute of Hearing Technology and Audiology, Jade University of Applied Science, Oldenburg, Germany
(5)EURECOM, France
Abstract
The advancement of speech technology has profoundly transformed how humans interact with machines, making everyday tasks more accessible through voice commands and automated responses. However, this convenience brings significant privacy risks, as speech data inherently contains sensitive personal identifiable information (PII), ranging from voiceprints to conversational content. Addressing these risks, recent works focus on developing robust anonymization techniques to protect user privacy without compromising the utility of speech technologies.
Our research introduces sophisticated anonymization models that leverage neural audio codecs (NACs), voice conversion technologies, and a two-stage anonymization process that effectively disentangles speaker identity from speech signals. These models mitigate the risk of PII leakage by ensuring that anonymized outputs retain less speaker-specific information, thus enhancing privacy. Additionally, innovations in privacy-preserving feature extraction methods like spectral smoothing and the application of the McAdams coefficient are explored. These methods are particularly effective in scenarios with low computational resources, providing viable solutions for real-time applications on portable devices.
By integrating these advanced techniques, the research aims to set new benchmarks in privacy protection in speech technology. This integration not only addresses the technical challenges of anonymization but also aligns with stringent global data protection regulations, making it a timely contribution to the field. The collective findings underscore the importance of developing speech technologies that prioritize user privacy while maintaining functional integrity and utility.
Microphone Pair Selection for Sound Source Localization in Massive Arrays of Spatially Distributed Microphones
Bilgesu Çakmak∗, Thomas Dietzen∗, Randall Ali†, Patrick Naylor‡, Toon van Waterschoot∗
∗ Dept. of Electrical Engineering (ESAT), STADIUS, KU Leuven, Leuven, Belgium
† Institute of Sound Recording Dept. of Music & Media, University of Surrey, Guildford, UK
‡Dept. of Electrical and Electronic Engineering, Imperial College London, London, UK
Abstract
In massive distributed microphone arrays, such as in conference systems, the use of all sensors leads to unnecessary energy consumption and limited network life, as some sensors may have a limited contribution to specific estimation tasks, such as sound source localization (SSL). In this work, we propose two pre-processing methods for microphone pair selection in steered response power (SRP) based SSL using massive arrays of spatially distributed microphones. The first method is based on thresholding the cross-correlations which results in a selection of broadside microphone pairs, that is, microphone pairs that are oriented broadside to the source direction, and only the signals from those microphones are collected for use in the SRP algorithm. The second method is based on sparseness of the cross-correlation in which the pairs which have a high sparsity factor are selected for use in localization. These methods are further improved by leveraging orientational diversity. Simulations show that we achieve a localization performance only using the selected microphone pairs that is comparable to the performance when all microphones are employed.
Drone Localization and Tracking System Using Acoustic Arrays and Machine Learning Algorithms
Ximei Yang (Fraunhofer IDMT), Christian Rollwage (Fraunhofer IDMT), Simon Doclo (University of Oldenburg), Patrick A. Naylor (Imperial College London), Jörg Bitzer (Fraunhofer IDMT)
Abstract
With the increasing affordability and sophistication of unmanned aerial vehicles (UAVs), concerns have arisen regarding security, privacy, and noise. Traditional audio-based drone localization methods, relying on cross-correlation or subspace approaches, face limitations in handling complex scenarios. Deep learning has emerged as a promising solution, yet its reliable training requires substantial datasets, often laborious and resource-intensive to collect.
This study addresses the challenge by synthesizing drone signals and constructing a dataset for deep neural network (DNN)-based localization. The signal from the rotating propellers of the drone is generated based on the physics of sound generation, with the estimation of real-time revolutions per minute (RPM) of the drone rotors being crucial. The synthetic drone signals closely resemble real data in 3D space under diverse conditions, accounting for drone type, manoeuvre type, flight path, wind speed, and direction. A comprehensive drone positioning framework is also established, incorporating a 3D outdoor drone flight model accounting for signal delays, reflections, microphone array white noise, and environmental noise.
These simulated datasets serve as training data for sound source localization (SSL) neural networks, evaluating their performance in UAV localization against classical signal processing algorithms, thereby advancing the field of drone positioning in challenging outdoor acoustic environments.
Neural-SRP: Neural Steered Response Power methods for sound source localization
Eric Grinstein1, Christopher M. Hicks2, Toon van Waterschoot3, Mike Brookes1, Patrick A. Naylor1
1Electrical and Electronic Engineering Department, Imperial College London, U.K.
2CEDAR Audio Ltd., U.K
3Department of Electrical Engineering (ESAT), KU Leuven, Belgium
Abstract
Neural networks have achieved state-of-the-art performance on the task of acoustic Sound Source Localization (SSL) estimation using microphone arrays. In turn, classical signal processing-based techniques such as the Steered Response Power (SRP) method remain a relevant research topic due to their explainability and generalization to unseen environments. A third alternative consists of forming a hybrid model including neural- and signal processing-based components. This work focuses Neural-SRP methods, which add or replace functional blocks into the SRP method. We present and compare their different variations, evaluating advantages and disadvantages of each of them.
Dereverberation in Acoustic Sensor Networks using the Weighted Prediction Error Algorithm
Anselm Lohmann1, Toon van Waterschoot2, Joerg Bitzer3, Simon Doclo1,3
1Carl von Ossietzky Universität Oldenburg, Dept. of Medical Physics and Acoustics, Germany
2KU Leuven, Department of Electrical Engineering (ESAT-STADIUS), Leuven, Belgium
3Fraunhofer IDMT, Project Group Hearing, Speech and Audio Technology, Oldenburg, Germany
Abstract
Reverberation can severely degrade the quality of speech signals recorded using microphones in a room. In the last decades several multi-microphone speech dereverberation algorithms have been proposed, among which the weighted prediction error (WPE) algorithm. WPE performs dereverberation by estimating the late reverberant component in a reference microphone from all reverberant microphone signals and subtracting this estimate from the reference microphone signal. The so-called prediction delay plays an important role in WPE and is introduced to reduce the correlation between the prediction signals and the direct component in the reference microphone signal, hence aiming to preserve the direct component and early reflections.
The WPE algorithm and its variants have typically been developed and validated for compact microphone arrays with closely spaced microphones, using a microphone-independent prediction delay and an arbitrary reference microphone. In this contribution, we consider an acoustic network with spatially distributed microphones and revisit several design choices regarding the prediction delay and the reference microphone for the WPE algorithm.
First, in an acoustic sensor network large and diverse time-differences-of-arrival (TDOAs) of the speech source between the reference microphone and other microphones may occur. Hence, when using a microphone-independent prediction delay the reference and prediction signals may still be significantly correlated, leading to distortion in the dereverberated output signal. In order to decorrelate the signals, we propose to apply TDOA compensation with respect to the reference microphone, resulting in microphone-dependent prediction delays for the WPE algorithm. Experimental results using estimated TDOAs clearly show the benefit of using microphone-dependent prediction delays.
Secondly, whereas the choice of reference microphone does not typically influence the dereverberation performance in compact arrays, in an acoustic sensor network the choice of the reference microphone may significantly contribute to the performance of WPE. In this contribution, we propose to perform reference microphone selection for WPE based on the normalized l-p norm of the dereverberated output. Experimental results show that the proposed method yields a better dereverberation performance than a selection based on the early-to-late reverberation ratio or signal power.
Furthermore, the spatial diversity of the available microphone signals in an acoustic sensor network may allow for a similar dereverberation using only a subset of all available microphones. Using the popular convex relaxation method, we propose to perform microphone subset selection by introducing a group sparsity penalty on the prediction filter coefficients. Experimental results using the proposed method show that a similar dereverberation performance can be achieved using only a subset of all available microphones.
Deep digital Joint Source-Channel based wireless speech transmission
Mohammad Bokaei1, Jesper Jensen1,2, Simon Doclo3, Jan Østergaard1
1Aalborg University, Aalborg, Denmark
2Oticon A/S, Copenhagen, Denmark
3Carl von Ossietzky Universitat, Oldenburg, Germany
Abstract
In this paper, we study low-latency speech transmission over a wireless channel based on deep digital Joint Source-Channel Coding (JSCC). Inspired by recent advances in quantization techniques in the realm of JSSC problems, we explore the feasibility of employing constellation-constrained deep JSCC for speech transmission. Our proposed system leverages a single DNN to jointly handle source coding, channel coding, and direct output mapping to specific constellation points. We demonstrate the ability of the joint system to operate effectively under various latency constraints while outperforming separate systems, especially in adverse channel conditions. Simulation results validate the efficacy of our approach, highlighting its potential for real-world applications requiring low-latency speech transmission over wireless channels.
Binaural Speech Enhancement Using Complex Convolutional Networks
Vikas Tokala, Dept. of Electrical and Electronic Engineering, Imperial College London, UK.
Mike Brookes, Dept. of Electrical and Electronic Engineering, Imperial College London, UK.
Simon Doclo, Dept. of Medical Physics and Acoustics, Carl von Ossietzky Universität Oldenburg, Germany.
Jesper Jensen, Demant A/S, Smørum, Denmark and Dept. of Electronic Systems, Aalborg University, Denmark.
Patrick A. Naylor, Dept. of Electrical and Electronic Engineering, Imperial College London, UK.
Abstract
Binaural speech enhancement has been established in recent years as the state-of-the-art approach for enhancement in hearing aids and augmented/virtual reality devices. Studies have shown that in noisy acoustic environments, providing binaural signals to the user of an assistive listening device may improve speech intelligibility and spatial awareness. Modern assistive listening devices such as binaural hearing aids and virtual reality headsets are equipped with multiple microphones allowing speech enhancement methods to utilise this multichannel input for binaural speech enhancement. In this work, different models using complex-valued convolutional networks for binaural speech enhancement were studied. To enhance two-channel noisy binaural signals, this study introduces two complex-valued convolutional networks within the convolutional encoder-decoder (CED) framework. The first model integrates a complex-valued multi-head attention-based transformer block between the encoder and decoder components of the CED network. In contrast, the second model substitutes the transformer module with a complex-valued bidirectional LSTM block. The networks are trained to estimate separate complex ratio masks (CRM) for the left and right ear channels. The networks are trained using a novel loss function that incorporates the preservation of spatial information along with speech intelligibility improvement and noise reduction. Simulation results for acoustic scenarios with a single target speaker and isotropic noise of various types show that the proposed networks improve the estimated binaural speech intelligibility and preserve the binaural cues better in comparison with baseline algorithms. For the multichannel input, two neural multichannel binaural speech enhancement methodologies were investigated in this study, designed to maintain the spatial characteristics of speech while enhancing intelligibility and attenuating additive noise. These approaches integrate a complex convolutional neural network guided by a beamformer, as well as an end-to-end convolutional neural network based on a convolutional encoder-decoder architecture. These multichannel networks estimate complex ratio masks in the time-frequency domain for the left and right channels similar to the two-channel networks described earlier. For the beamformer stage, binaural versions of the Oracle MVDR (OMVDR) and Fixed Cylindrical Isotropic MVDR (FCIM) beamformers are considered. In terms of objective metrics, simulation results show that the proposed methods significantly improve binaural speech intelligibility and reduce noise while preserving the binaural spatial cues of the speech signal in acoustic situations with a single target speaker and isotropic noise of various types. A listening demo with audio examples of all the methods will be presented.
Probability distribution of spherical harmonic coefficients in rooms
Jesper Brunnström*, Martin Bo Møller’, Jan Østergaard^, Marc Moonen*
*STADIUS Center for Dynamical Systems, Dept. of Electrical Engineering (ESAT),KU Leuven, Leuven, Belgium
‘Bang & Olufsen, Acoustics R&D, Struer, Denmark
^Dept. of Electronic Systems, Aalborg University, Aalborg, Denmark
Abstract
Sound field estimates are important for many audio applications, such as sound field reproduction. Obtaining such estimates with good precision can be time-consuming, and therefore considerable effort is put into researching more effective estimators.
A Bayesian approach to sound field estimation has previously been proposed, where the coefficients of a spherical harmonic decomposition are estimated from the data recorded by a set of microphones. Under this Bayesian framework, incorporating a more accurate prior distribution leads to better estimates for a fixed amount of data, and can therefore lead to fewer required measurements, making the measurement process easier.
In previous methods, an assumption of Gaussian prior distributions has been made in order to maintain a closed form posterior distribution, which has clear computational advantages. However, given improvements to numerical Bayesian inference methods in recent years such as variational inference and markov-chain-monte-carlo sampling, assumptions on Gaussian distributions are not necessarily crucial.
The goal of this work is to investigate the probability distribution of spherical harmonic coefficients in real rooms, which can aid in model selection for Bayesian inference. The spherical harmonic coefficients are estimated in a number of different rooms and for different parameters, providing a dataset from which the distribution of the spherical harmonic coefficients can be observed. From the estimated coefficients, crucial parameters for a probabilistic model are identified, as well as reasonable choices of prior distributions for these parameters.
The goal of this work is to investigate the probability distribution of spherical harmonic coefficients in real rooms, which can aid in model selection for Bayesian inference. The spherical harmonic coefficients are estimated in a number of different rooms and for different parameters, providing a dataset from which the distribution of the spherical harmonic coefficients can be observed. From the estimated coefficients, crucial parameters for a probabilistic model are identified, as well as reasonable choices of prior distributions for these parameters.
Acoustic speech localization in the near field using distributed microphones
Kaspar Müller1, Markus Buck1, Simon Doclo2, Jan Østergaard3, Tobias Wolff1
1Cerence GmbH, Acoustic Speech Enhancement, Ulm, Germany
2University of Oldenburg, Department of Medical Physics and Acoustics and
Cluster of Excellence Hearing4all, Oldenburg, Germany
3Aalborg University, Department of Electronic Systems, Aalborg, Denmark
Abstract
Heterogeneous environments such as, e.g., smart homes or modern cars have multiple microphones to pick up speech signals from different positions. This allows for a large variety of speech-processing applications, such as hands-free systems for voice calls or voice control that require speech enhancement, e.g., to reduce noise or interfering speech. One common challenge of such applications is a robust and computationally efficient speaker detection and localization using distributed acoustic sensors, rather than using microphone arrays as mostly done. We propose a speech localization method that allows to exploit both level and time differences in distributed microphone signals. Specifically, we developed a generalized version of the conventional steered response power method that can use arbitrary acoustic models, such as a near field model, and furthermore incorporate the directivity of sources and receivers to improve localization performance compared to the conventional steered response power method.
Adaptive Sound Zones with Online Room Impulse Response Estimation
José Cadavid1, Martin Bo Møller2, Søren Bech1,2, Toon van Waterschoot3, Jan Østergaard1.
1 Department of Electronic Systems, Aalborg University, Aalborg, 9000, Denmark
2 Bang & Olufsen A/S, Struer, 7600, Denmark
3 Department of Electrical Engineering (ESAT), KU Leuven, Leuven, 3001, Belgium
Abstract:
Among several sound field control methods, Sound Zones aim to render different audio contents in separate regions of a room. This requires creating zones where a given content is intended to be heard, and others where it is to be inaudible. Known respectively as bright and dark, such zones must be carefully sampled in space to obtain precise information about the system at those positions, information usually expressed as Room Impulse Responses (RIRs). These are necessary to design a set of filters which, applied to a set of loudspeakers, allow rendering the intended sound zones.
Currently, changes in the system require new RIRs acquisitions and control filters calculations, limiting the versatility of the sound zones. However, recent methods in Active Noise Control enable simultaneous online estimation of the RIRs and update of the control filters. In practice, such systems would allow adapting the sound zones rendering to changes in the position of the listener, in the acoustic conditions of the room or, in general, to any change that can be described in the RIRs.
Comprising two different adaptive processes, their implementation is both complex and computational expensive. In order to further delve into this problem, this work compares some of the existing methodologies and, exploring the influence of the main parameters involved in the resulting performance, provides insights on possible strategies for simpler and more efficient implementations. In this regard, and in contrast to statements in the original formulations, preliminary results demonstrate the feasibility of using block-based algorithms, setting the base for real-time applications through more efficient, faster processing.
Distributed Blind System Identification In Wireless Acoustic Sensor Networks
Matthias Blochberger (1), Jesper Jensen (2,3), Jan Østergaard (3), Randall Ali (4), Filip Elvander (5), Marc Moonen (1), Toon van Waterschoot (1)
(1) Department of Electrical Engineering (ESAT-STADIUS), KU Leuven, Leuven, Belgium
(2) Demant A/S, Smørum, Denmark
(3) Department of Electronic Systems, Aalborg University, Aalborg, Denmark
(4) Department of Music and Media, University of Surrey, Guildford, Surrey, United Kingdom
(5) Department of Information and Communications Engineering, Aalto University, Espoo, Finland
Abstract
We present an overview of the most important results of our research into the task of blind system identification (BSI) in wireless acoustic sensor networks (WASNs). Here, BSI refers to estimating room impulse responses (RIRs) or their frequency-domain representation of a single-input-multiple-output (SIMO) system, i.e., a physical room with a single sound source and multiple acoustic sensors, without knowledge of the source signal. The sensor network comprises sensor nodes, each with a microphone, processing unit and transmission capabilities, that share information over a network graph. Building upon the well-established cross-relation method (CR), we use an adaptive distributed estimation algorithm employing the online alternating direction method of multipliers (O-ADMM) to solve a general-form consensus optimization problem in a distributed manner. We separate the high-dimensional multichannel problem into lower-dimensional subproblems, which are then solved by the processing units of the sensor nodes, all mathematically connected with consensus constraints. The subproblems can be seen as eigenvalue problems and are solved adaptively, i.e., by tracking the smallest eigenpair of the problem. Further, we introduce an adaptive distributed averaging scheme to estimate the global norm of the system to enforce a non-triviality constraint that also maintains the scaling between subproblems and therefore channel estimates. We employ an adaptive gain-shape coding approach using Huffman encoding to reduce communication cost, i.e., the amount of data transmitted between sensor nodes. To improve the adaptive algorithm’s robustness in noisy conditions, especially when the source signal is speech, i.e., non-white and non-stationary, we estimate noise characteristics in signal periods where only noise is present and formulate the original problem as a generalized eigenvalue problem, which we then solve in an adaptive approach similarly to the original problem. To show the algorithm’s performance, we present numerical simulation results of various acoustic scenarios, such as different room sizes and signal-to-noise (SNR) ratios.
Understanding the direct and reverberant sound fields, an investigation with applications
James Brooks-Park (1), Jan Østergaard (2), Søren Bech (2&3), Steven van de Par (1)
(1) Carl von Ossietzky Universität Oldenburg Acoustics Group, Oldenburg, Germany
(2) Aalborg University, Dept. of Electronic Systems, Aalborg, Denmark
(3) Bang & Olufsen a/s, Struer, Denmark
Abstract
The differences between the direct and reverberant sound field are well known in perceptual acoustics. Physically differentiable by the time of arrival, time duration and their phase and spectral characteristics. Perceptually, the effect the direct sound and early reflections have on source localisation and the apparent source width have been investigated extensively. In this poster, thresholds for detecting spectral changes of the direct and reverberant sound fields are determined by the means of psychoacoustical experiments, furthering our knowledge about how humans perceive sound in reverberant environments.
Knowing how humans perceive the direct and reverberant sound fields differently, a room compensation method is proposed, where only the perceived reverberant sound field is altered. The anechoic response of a loudspeaker can be assumed to be well optimised by the manufacturer, giving a loudspeaker its’ own acoustic signature . However, due to the directivity characteristics of a loudspeaker and imperfect room reflections, the reverberant sound field is often uncontrolled and can degrade loudspeaker playback quality in reverberant environments. The technique prosed in this poster, avoids many of the problems traditional room compensation methods face, as well as providing a new degree of control over the sound field, where the direct and reverberant sound fields can be optimised individually. Perceptual evaluation results are presented, investigating if and how the proposed method outperforms traditional techniques.
Comparison of two methods for audio quality evaluation of reproduction systems using spatially dynamic content
Pia Nancy Porysek Moreta (1,2)
Søren Bech (1,2)
Jon Francombe (1)
Jan Østergaard (1)
Steven van de Par (3)
(1) Bang & Olufsen
(2) Aalborg University
(3) Carl von Ossietzky University of Oldenburg
Abstract
Advances in spatial audio technology enable an improved and more complex rendering of sound source movement (dynamic stimuli). The literature indicates that standardized methodologies for sound quality evaluation may encounter challenges when assessing reproduction systems using spatially dynamic stimuli.
One possible solution is the use of temporal methods to record continuous observation of quality. In the presented study, a listening experiment was designed to explore differences when measuring attributes of basic audio quality with two different methodologies: 1) single stimulus single value quality evaluation and a 2) single stimulus continuous quality evaluation. The reproduction of two program items over two loudspeaker configurations (stereo and surround with height) were rated for two attributes of audio quality (basic audio quality and surrounding) across two different spatial variations (dynamic and stationary).
The findings presented in this poster will offer insights and recommendations for future research concerning the evaluation of reproduction systems. Specifically, their performance within the context of handling complex, spatially dynamic content will be explored, thus advancing the ability to test technological advancements in the area of spatial audio.