Issue
Acta Acust.
Volume 9, 2025
Topical Issue - Virtual acoustics
Article Number 31
Number of page(s) 15
DOI https://doi.org/10.1051/aacus/2025012
Published online 24 April 2025

© The Author(s), Published by EDP Sciences, 2025

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

In the design process of new auditoriums, optimising the acoustics of a room can be critical to meet the needs of its intended use. This optimisation involves considering the room's dimensions, as well as defining absorbing and reflecting surfaces. Typical room-acoustic parameters, such as reverberation time (T) or speech clarity (C50), can be estimated. Once the building is constructed, Room Impulse Responses (RIRs) can be measured to verify the expected room acoustic parameters. At this point it is, however, difficult to make significant adjustments. Hence, it would be beneficial for an acoustician to listen to the room prior to construction and thus begin adjustments earlier. This can be achieved through auralisation, which is the process of making measured or simulated, sound fields audible by reproducing them via headphones or loudspeakers. An early overview can be found in [2]. An additional motivation is that the addition of real-world audio to purely visual virtual reality (VR), compared to no audio and stereo audio only, could increase realism and presence when using VR in experiments or clinical applications [3]. However, before it can be assumed that the VR experiment reflects reality, it is important to evaluate and validate head-tracked binaural auralisations in VR.

Auralisation can be accomplished by either measuring RIRs in an existing room or by simulating the acoustics of the designed room. A RIR allows the acoustician to listen to the room response when an anechoic stimulus is convolved with it, which is a useful first step, but a far cry from actual perception in the real room, since the result lacks spatial cues. To get closer to reality, binaural measurements can be carried out with a Head-and-Torso simulator (HATS), or by simulating rooms with a set of head-related transfer functions (HRTFs). Both approaches yield binaural room impulse responses (BRIRs). In simulations, BRIRs are calculated by applying appropriate HRTFs to simulated direct sound and room reflections reaching the listener from various angles. BRIRs can be simulated, or measured, for different source- and receiver positions, as well as for various head orientations, so that the listener can in principle move freely within the room.

A reduced effort could be to simulate BRIRs instead of measuring them. This would also be a step closer to the overall goal of listening to the room before it is constructed. Hence, there is a need for the room-acoustic simulation to be close to reality. To test the performance of room-acoustic simulation tools, room-acoustic scenarios can be defined to challenge the simulation and compare the results with measurements. Previous studies comparing auralisations based on measurements with those based on simulations have typically assumed the measurement to be the more realistic condition. There, a simulation is considered optimal if participants were unable to distinguish between measurement and simulation [47]. Lokki and Pulkki [4] compared measured recordings of a loudspeaker in a real room with a simulated loudspeaker in a room for both binaural and monophonic auralisations using the ABX method with interval scale. They found that, depending on the stimulus, the auralisation was almost indistinguishable from the measurement. This was also found by [5], who compared a computer simulation, a dummy head recording, and measured BRIRs. Postma and Katz [6] compared measured binaural auralisations with simulated ones produced by calibrating the geometrical acoustics model in terms of reverberation and clarity measures based on omnidirectional measurements. Tommasini et al. [7] developed and evaluated a real-time auditory virtual environment, investigating system performance in terms of both objective measures and perceptual attributes. They showed that the total system latency was below detectability. Untrained subjects rated the auralisations in terms of metallic tone, localisability and level of reverberation taken from the Spatial Audio Quality Inventory (SAQI) [8] as comparable to the BRIR measurement in the room. All these studies have in common that the comparison is made with respect to a measurement, using different tools to simulate a room. Additionally, head-tracking was not used during the listening tests.

A more comprehensive comparison would be a round-robin test, where the results of different room-acoustic simulation tools can be compared for pre-defined scenes or challenges. An overview of the comparison of room-acoustic simulation with reality can be found in [9]. Besides the overview and the design of a database for the evaluation and validation of room-acoustic simulations, a round-robin for uninformed simulations was carried out, which showed that the simulations mostly deviate from the measurement, possibly caused by the shortcomings of the simulation tools as well as the limited input data, e.g., absorption coefficients of the walls, see also [10] for further details. As there were clearly audible differences between measured and simulated impulse responses, the authors stated that plausible, rather than authentic, auralisations were achieved. In addition, an informed simulation was performed for one of the simulation tools, where the acoustic details were known and the simulation could be tuned. Good agreement with the measurements was found, especially for a small seminar room and an auditorium. Deviations did occur, but were hardly noticeable in informal listening tests.

Measurements, however, may have inherent inaccuracies in themselves, and therefore a more accurate comparison would involve auralisations and a real loudspeaker playback in an actual room. In addition, as head-tracking has been shown to improve auralisation [11], most of the following studies provided the opportunity to use head tracking.

Recent paradigms comparing dynamic auralisations with a real loudspeaker are authenticity [12], plausibility [13], and transfer plausibility [14]. Brinkmann et al. [12] measured individual BRIRs for different head orientations, room conditions and loudspeaker positions. Then, using speech and pulsed pink noise stimuli, the authors investigated whether expert listeners could discriminate between reality and simulation in an ABX test paradigm. They found that perceptually authentic auralisation is possible for speech stimuli. Measuring such individual BRIRs is very time-consuming, and it is not even possible to measure individual BRIRs when the room to be auralised does not yet exist. It is therefore interesting to see to what extent it is possible to reduce measurement effort, potentially reducing accuracy, and still achieve a good perceived match between reality and simulation.

Lindau and Weinzierl [13] designed a paradigm to assess the plausibility of a virtual source compared to a real source using a yes/no paradigm. They compared measured HATS BRIRs for five loudspeaker positions with the real loudspeaker reproduction. They found that plausibility can be assumed for dynamic binaural auralisations and that plausibility is not as sensitive to colouration deviations as authenticity. Thus, plausible dynamic auralisations can be achieved using generic HRTFs. Further investigations towards plausible or authentic auralisations was done under static and anechoic conditions in [15], which focused on source and receiver properties, but did not consider room acoustical effects or the influence of head movements. Lübeck and Pörschmann [16] showed that for a comparable study design in an anechoic room, but using dynamic auralisations with six degrees of freedom, plausibility was achieved even with non-individual HRTFs. Pike et al. [17] evaluated this paradigm for generic HATS in a small real room and found slightly lower plausibility than [13]. For a broader overview of the challenge of implementing auditory augmented reality, see [18].

A next step would be to compare simulated BRIRs with measured BRIRs and a real loudspeaker. To test the effectiveness of Interactive Virtual Environments (IVEs) in auralisations, Stärz et al. [19] conducted a plausibility experiment, based on Lindau and Weinzierl [13], comparing head-tracked binaural auralisations using both measurements and simulations to a real loudspeaker placed in the actual room that was simulated. Participants responded to these presentations with a yes-no response to the question as to whether the presentation was a real loudspeaker or not. Provided that the actual loudspeaker directivity was included in the room acoustic simulation, a high plausibility was achieved for both measurements and simulations, even for auralisations using generic HRTFs. This can be interpreted as participants believing that the auralisation is consistent with the heard and seen scenario, even when slight variations in reverberation, tone colour, or other room acoustic attributes may have been present in the auralisations. As this was a yes-no task, such plausibility tests do not reveal why different auralisations are judged to be less plausible, and whether differences exist, or how large they are when comparing different auralisations.

Blau et al. [20] conducted a listening test using head-tracked binaural auralisations in a classroom scenario, where auralisations were rated close to reality when directly compared to a loudspeaker in the actual room. In that case, the room simulation was tuned based on a monaurally measured RIR. Interestingly, no advantage was found in the individual versus HATS BRIRs. Since the participants were in the room being investigated, and visual information was consistent with the auralisation, there was no cross-modal mismatch in acoustic and visual cues. It has been argued that this is important, and inconsistent visual cues may impair the auralisation [21]. Therefore, it is an open question as to whether this perceived realism can be achieved in a room that differs from the original setup, and whether visualisation over an Head-Mounted Display (HMD) can serve as a substitute for the visual information of the real room and still elicit the same perceived realism. Before this can be investigated, however, it is necessary to test whether auralisations can achieve comparable ratings of room acoustic parameters when the visualisation is produced with an HMD instead of seeing the real room.

To illustrate the relevance of the issue outlined above, consider a typical scenario for auralisations prior to construction that may involve an acoustician wanting to listen to the auralisation while being in an office-like environment. This introduces the challenge of the room-divergence effect, where auralisation can suffer from mismatches between what is being auralised and the room in which the listener is listening (presentation room) [21]. To mitigate this effect, audiovisual IVEs present a promising solution, utilising a suitable combination of visual room models and auralisations. IVEs provide highly controlled, adjustable conditions that are immersive and can closely approximate reality. Immersion, defined as the technical aspect of realism, is expected to increase when plausible auralisations are employed [22]. However, even if visual room models were well made, the scenario may still deviate from the original scene, potentially leading to the room-divergence effect. Roßkopf et al. [23] demonstrated that visual presentation influences acoustic judgments. Furthermore, in a later study employing a visual virtual scene presented via an HMD, participants’ distance estimates of both virtual and real sound sources were distorted [3]. Therefore, interactions between acoustic and visual information must be suspected.

The present study, was intended to address this issue, by conducting a multi-stimulus rating experiment. The aim was to determine whether differences in the ratings of room acoustics are introduced by different auralisation procedures in an audiovisual virtual environment with HMD and headphones. Auralisations may be based on measurements or simulations, and may involve individual or generic HRTFs. The underlying research question was: In an audiovisual presentation of a room where the room is visually presented via an HMD, is it possible to achieve ratings comparable to those of a real loudspeaker when using auralisations with varying levels of detail?

To understand the requirements for perceptually accurate room auralisation, an existing classroom was used in this study. This allowed comparing the auralisations with a real loudspeaker source in the room, as well as investigating the influence of using individual HRTFs, the effect of wearing an HMD, and the accuracy that can be achieved in simulating the room acoustics.

We also introduced well-defined manipulations of reverberation time and distance to the simulations, to be able to compare observed differences in ratings for the conditions under investigation against differences occurring due to well controllable and interpretable changes in acoustic parameters.

The paradigm of the current study was initially introduced in [1]. Here, additional attributes, such as plausibility and externalisation, were added. In addition, we observed deviations in the tone colour of auralisations using BRIRs measured with a HATS, compared to the listening to the real loudspeaker in [1]. To investigate this closer, we include a dataset of BRIRs measured with one human subject for a wide range of head orientations. This one human subject effectively substituted the BRIRs measured with the HATS in our experiment, aiming to determine whether the observed difference in tone colour can be attributed to factors such as the absorption by hair and clothing. The various conditions used in the experiments presented here should provide more insight into the question: In an audiovisual presentation with headphones and HMD, can non-individual auralisations based on (BRIR/HRIR) measurements be considered a valid reference condition when a comparison with a real source is not possible?

Additionally, we address the influence of Head-Worn Devices (HWDs). The passive impact of different open headphones has been demonstrated in terms of colouration and localization [2426]. Various circumaural and one supraaural headphone exhibited a maximum distortion in the interaural level difference of more than 20 dB. Additionally, a reduction in localisation precision was observed [24]. Schneiderwind et al. [25] also examined extra-aural headphones, which attenuate less compared to circumaural headphones. However, even when headphones are designed to be as acoustically transparent as possible, the goal has not yet been fully achieved. One proposed solution is to use BRIRs that account for the passive influence of the headphones. Besides using different headphone types, Lladó et al. [26] additionally included a worn HMD. Compared to the headphones, the HMD had the least effect on colouration and no effect on localisation. As the passive influence of HWDs is likely to impact colouration and localisation, we investigated whether this can also be seen when the task is to rate room acoustical attributes. To this end, all but one of the BRIRs were measured or simulated from HRTF measurements using an HMD headphone combination, to evaluate whether the one condition without HMD led to different ratings.

2 Methods

2.1 Room under investigation

The room under investigation is a small lecture room (7.12 m × 11.94 m× 2.98 m) at Jade Hochschule in Oldenburg, Germany. The room comprises a window front with open curtains, three plastered brick walls, an acoustically optimised ceiling, and a linoleum floor. We placed a loudspeaker (Genelec 8030b, Genelec Oy, Iisalmi, Finland) in front of the participant at a distance of 4.3 m and a height of 1.6 m. The listener was positioned centrally in the room but slightly off-axis, with an ear height of 1.3 m.

2.2 Interactive Virtual Environment

To present the audiovisual stimuli, we used an interactive virtual environment by combining an HMD (HTC VIVE Pro Eye, HTC Corporation, Xindian, New Taipei, Taiwan) and headphones (AKG K1000, AKG Acoustics GmbH, Vienna, Austria) connected via an aluminum support described by Stärz et al. [27]. Unreal Engine v.4.27.2 (Digital Extremes, London, Ontario, Canada and Epic Games, Inc., Raleigh, NC, USA) was used to present a virtual room model of the room described above. The virtual room had previously been created in 3Ds Max (Autodesk, Inc., San Rafael, CA, USA) by architecture students at Jade Hochschule Oldenburg. The comparison is visualised in Figure 1. For the interaction in VR, we developed a graphical user interface (GUI) with sliders, play buttons, a button to order the sliders and some additional buttons to navigate through the listening test, see Figure 2. Prior to the listening test, participants were given an introduction to how to interact within the IVE.

thumbnail Figure 1

Room under investigation. Top: A HATS at the listening position within the room. Bottom: The virtual model of the identical room, displayed through a HMD during the listening test.

thumbnail Figure 2

Graphical user interface for participants to interact in VR using sliders to rate different auralisations for different attributes. In this case, the attribute being rated is reproduction quality. Wiedergabequalität = Quality of the reproduction; Hoch = high; gering = low, Ordnen = Order.

2.3 Auralisation details

An overview of the auralisations used during the listening test, is shown in Figure 3. Three main parts can be distinguished: the real room as a given base (cf. Sect. 2.1), a set of precomputed data calculated before the listening test, and the real-time rendering during the listening test.

thumbnail Figure 3

A block diagram providing an overview of the steps involved in realising the head-tracked binaural auralisations. Open headphones were used to allow comparison with a real loudspeaker in the real room.

2.3.1 Real room

The real room served as the place where the listening test took place, but was also the target for the auralisations. Several audio and visual raw-data-like room dimensions or monaural RIRs were extracted to simulate or measure the room, as well as to build a room model that could be presented through the HMD. The real room is shown as the starting point of the block diagram in Figure 3.

2.3.2 Pre-computed data

The pre-computed BRIR sets used in this study were based either on measurements in the real room or on a room simulation. An overview of the BRIR sets used is given in Table 1.

Table 1

Overview of used BRIR sets.

Measured BRIRs

A total of three BRIR sets were measured in the real room. In two of them, a HATS (KEMAR type 45BB, GRAS Sound and Vibration A/S, Holte, Denmark) was used. MEMS microphones (TDK type ICS-40619, TDK InvenSense, San Jose, CA, USA) were used together with PIRATE earplugs [28] and placed in the ear canal, creating a blocked ear condition.

The BRIR measurement in the real room, as part of the precomputed data block shown in Figure 3, was performed by measuring 37 azimuthal head orientations ranging from −90° to 90°, resulting in a resolution of 5°. Only the head was rotated in this measurement. For the first measurement, called “meas:HATS woHMD” in the following sections, we used the HATS without further modification. The second BRIR set, called “meas:HATS” is similar, but the HATS was wearing the HMD headphone combination presented in Section 2.2. Additionally, a measurement “meas:Human” was performed with a single human subject replacing the HATS. This measurement was also taken with the HMD headphone combination. As this measurement was extremely tedious for the human participant, the measurement points were reduced to 25 azimuthal head orientations ranging from −60° to 60° in 5° steps. To overcome the limitation of head orientations, participants were instructed to keep the loudspeaker in their field of view during the listening test. As the manufacturer specifies the HMD field of view as 110°, the limits of ±60° were not exceeded.

Simulated BRIRs

In addition, we used simulated BRIRs obtained by using a room-acoustic simulation tool together with Head Related Impulse Responses (HRIRs). These simulated BRIRs are also part of the pre-computed data block of Figure 3. We used the room-acoustic simulator RAZR (version 0.962b) [29, 30] for simulating BRIRs. RAZR was used with a 3rd order Image Source Model (ISM) [31], followed by a Feedback Delay Network (FDN) [32]. In RAZR, the actual room geometry was approximated by a shoebox-shaped one. Additionally, the frequency-dependent absorption coefficients per wall were estimated based on the respective wall materials. As this is a rough approximation, monaural RIRs were measured, averaged over a total of 25 source-receiver combinations, and the room absorption coefficients in RAZR were adjusted to match the frequency-dependent reverberation time for this monaural RIR, as suggested by Blau et al. [20]. We also used a loudspeaker directivity database, as well as measured HRTF databases, for the HATS and for individual participants. In the room-acoustic simulation, both the HRTFs and the directivity of the loudspeaker were selected using the nearest-neighbour method.

HRTF database

HRTFs were obtained using a custom-built measurement setup with 24 active loudspeakers (Speedlink type SL-8902-GY Xilu, Jöllenbeck GmbH, Weertzen, Germany). These loudspeakers were mounted on an aluminum circular arc measuring 240° with a radius of 1.25 m. Similar to the BRIR measurement, MEMS microphones were placed to obtain a blocked-ear condition. A sweep signal using the modified multiple exponential sweep method [33, 34] was employed, allowing 24 unique HRIRs to be measured in one measurement. As the arc is symmetrical, twelve different elevation angles (−30 to 30° in 7.5° steps and 45°, 60°, and 75°) for two azimuthal directions were measured simultaneously. The measurement arc could then be rotated to discretely sample the spherical surface around the participant. We used an azimuthal resolution of 5°, resulting in 37 measurement runs. In total, the entire HRIR set contains 864 directions of incidence. As the entire measurement took 25 min, we integrated a head-tracking procedure that helped participants keep their head and body fixed during the measurement. Apart from head tracking, the HRTF procedure was the same for both HATS and individual participants.

Due to the low-frequency limitation of the speakers, the measured HRIR data is only valid down to 350 Hz. To compensate for this, we fitted a spherical head model taken from AKtools [35] and used a spline interpolation in the frequency region between 250 Hz and 450 Hz to extrapolate the magnitude. To estimate an optimal spherical-head radius, the height, width and length of each participant's head were measured and the optimal head radius was estimated as a weighted sum using the method and weights proposed by Algazi et al. [36]. The HRIR data were stored in the Spatially Oriented Format for Acoustics (SOFA) [37] and later used to simulate BRIRs with RAZR.

Artificial modifications

The last three BRIR sets were based on the simulation described above, but were manipulated so that the listener could be able to perceive a difference in some attributes. This was done for two reasons. Firstly, it showed whether participants were able to hear and rate the modifications, and therefore whether the test method was working. Secondly, it also helped to interpret the differences for the other BRIR sets compared to the hidden reference, the real loudspeaker.

First, we moved the source position 2 m closer to the receiver compared to the initial simulation. With this modification, we introduced a perceivable distance change that was also expected to affect colouration and loudness. This will be later referred to as “sim:Distance”. Second, we changed the underlying room absorption coefficients to increase the perceivable reverberation. To this end, the initial room absorption coefficients were reduced to 85% of their original value, resulting in an 18% higher reverberation time. This was expected to mainly change the perception of reverberation, but might also have influenced loudness, colouration, and distance. This BRIR set is called “sim:Wet”. The last BRIR set, referred to as “anchor” was designed to manipulate the perceived externalization. Therefore, we simulated BRIRs and used the left ear signal also for the right ear, i.e. no interaural level- or time differences remained. This also completely removed the effect of the head tracking, leading to a static representation that differed strongly from all other BRIR sets. To counteract this, we added frequency-independent amplitude panning described by Sengpiel [38], previously used as an anchor condition by Klein et al. [39]: Based on the head orientation, the impulse responses for the left and right ear were multiplied with a factor; i.e. when the head was rotated completely to the left, the difference between both ears was 18 dB with the louder sound on the right ear. This BRIR set was designed to be an anchor for externalisation, but plausibility could also have been affected by this manipulation.

The resulting A-weighted third-octave band sound-pressure levels and reverberation time T20 for all the BRIRs investigated are shown in Figure 4.

thumbnail Figure 4

Objective data derived from all BRIRs used in this study. Left: reverberation time (T20) expected to be the same for all BRIR sets except for sim:Wet; Right: A-weighted third-octave band sound pressure levels after convolving the speech stimulus used in the listening test with the BRIRs.

Headphone equalisation

Individual headphone impulse responses were measured prior to the listening test, and a headphone equalisation filter (HPEQ) was derived by regularised inversion [40]. This was done immediately prior to the start of the listening test. Therefore, the headphones were not moved, which should have helped to eliminate the audible influence of repositioning headphones [41]. The HPEQ was then applied to all precomputed BRIRs.

Real Time Rendering

The real-time rendering block in Figure 3 represents the actual events during the listening test. For visualisation, the previously designed visual room model was played through the HMD using the Unreal game engine. In addition to running the user interface, Unreal also provided head-tracking data and Open Sound Control (OSC) commands for the audio real-time rendering. The headphone audio was computed using a time-variant partitioned convolution algorithm based on Jäger et al. [42, 43]. This algorithm selected a BRIR based on the current head orientation and convolved it with an anechoic stimulus in real time. The result was head-tracked binaural auralisation over headphones.

An important factor in IVEs is the latency of the audiovisual presentation. For the audio latency, Meyer-Kahlen et al. [44] described two methods for measuring the “motion-to-data-output” latency, i.e. the time from the movement to the time when the data is transmitted to the audio rendering software, and the “motion-to-sound” latency, i.e. the time from the head movement to the first audio sample. We used the method based on an impulsive movement. The measurement was repeated ten times, resulting in a median “motion-to-data-output” latency of 17.56 ms (M = 18.24 ms, SD = 2.55 ms) and a “motion-to-sound” latency of 38.88 ms (M = 37.01 ms, SD = 5.88 ms). This results in a “motion-to-sound” latency for most audiovisual IVEs that is unlikely to be detectable, assuming the threshold of 60 ms for most applications shown by Brungart et al. [45].

2.4 Listening test

In total, 21 participants took part in the listening test (4 female, 17 male, with a median age of 28 years). The participants were assumed to be expert listeners in the sense that they work or study in the field of hearing research, are familiar with listening tests, and have mostly heard auralisations before. Written informed consent was obtained from all participants, and an hourly fee of 12 euros was paid. The study was approved by the local ethics committee (University of Oldenburg) and complied with the tenets of the Declaration of Helsinki.

Participants had to rate the auralisations presented in Section 2.3 in a multi-stimulus rating paradigm. In total, there were eight different auralisation conditions, and one hidden reference, i.e. the real loudspeaker in the real room. Participants were informed in writing that they were allowed to rotate their head within a range of ±60° in azimuth. Seven different attributes, partly taken from the Room Acoustical Quality Inventory (RAQI) [46], were rated. The RAQI attributes to be rated with the corresponding poles were reverberance (dry – reverberant), tone colour (dark – bright), loudness (soft – loud) and source distance (near – optimal – far). In addition, we added reproduction quality (low – high), plausibility (not plausible – plausible) and externalisation (internalised – externalised). Participants rated on a scale from 0 to 100, divided in steps of 10, resulting in 11 different rating options per attribute, except for plausibility and externalisation. These were rated without intermediate steps, only the respective poles could be used. To gain a better understanding of what to evaluate, participants were first given a written introduction to some of the attributes. For instance, it was explained that an optimal source distance is based on a comparison with the visual representation: If the visual representation matches the auditory representation, the source distance is optimal. The “quality of the reproduction” was defined as the quality according to personal preference. The listening test started with a familiarisation phase during which participants listened for at least 30 s per stimulus, but were able to switch between all auralisations and the real loudspeaker without further information as to whether they were listening to the real loudspeaker or the auralisations. For each auralisation method, a total of three different source signals was used: a female voice, a guitar, and a saxophone stimulus. To obtain an overview of the variety offered by the different auralisations, participants were free to switch between the auralisations for each source sound separately.

During the main part of the listening test, a male speech stimulus [47] normalised to EBU R 128 [48] based on the “integratedLoudness” function from the Matlab audio toolbox [49], was used. Participants were able to restart the playback as often as they wished, and were able to take as much time as they needed to rate the different auralisations. For an easier comparison, participants could rearrange the sliders based on the current score, thus allowing faster and more direct comparisons of neighbouring sliders, as suggested by Chevret and Parizet [50].

Each condition presented was rated twice for each attribute in two blocks, with each block consisting of all seven attributes. In each block, the reproduction quality attribute was rated first. The other six attributes were randomised within each block. After the two repeated blocks, reproduction quality was rated a third time, followed by an interview, in which the participants could comment on their experience. Each auralisation could be listened to again. Typical questions were “Why did you rate auralisation X as best”, “Please compare auralisation A and B again”. The aim of this interview was to get a better understanding of why and how the participants rated each auralisation the way they did. It also helped to get a better understanding of each participant's rating behaviour. This third rating was excluded from further analysis.

3 Results

Figure 5 shows the raw data for each participant, averaged over the two repeated presentations. Statistical Analysis was conducted using the R environment [51] with the packages “psych” [52] and “ez” [53]. The data were tested for normality using the Shaprio-Wilk test [54, 55]. With the exception of the source distance attribute, a normal distribution can be assumed. Therefore, repeated measures ANOVA followed by Bonferroni-Holm [56] corrected paired t-tests were performed.

thumbnail Figure 5

Box plots showing the ratings averaged over the repeated measurements as a function of the BRIRs for each rated room-acoustic attribute. Statistically significant differences (repeated measures ANOVA, for source distance non-parametric Wilcoxon signed rank test, both corrected by Bonferroni-Holm) are presented, starting from the dot and refer to each downward tick on the respective line. The corresponding significance asterisk is placed below the line near the downward tick.

For source distance, the analysis was carried out using Matlab. The Friedman test [57, 58] was used, followed by the non-parametric Wilcoxon signed-rank test [58, 59] with Bonferroni-Holm correction [56].

As explained in Section 2.3.2, eight different BRIR sets were used: three based on measured BRIRs with different head-above-torso orientations (meas:HATS woHMD, meas:HATS, meas:Human), four based on simulations using either measured HATS or individual HRTFs, where the latter were additionally modified with respect to reverberation and source distance (sim:HATS, sim:Indiv, sim:Wet, sim:Distance), and one anchor.

The repeated measures ANOVA showed a significant effect of BRIRs for the attribute reproduction quality (F(8, 160) = 36.29, p< 0.001, η p 2 $\eta _{\rm p}^{2}$ = .64). Reproduction quality was rated with a median of 70 for the real loudspeaker. The majority of the auralisations were rated similarly, with the exception of “meas:Human”, “sim:Wet”, and “anchor” condition. It was observed that participants rarely used scores of 90 and above. Some participants reported that they did not intend to use the full range, and this can also be seen for the lower scores.

For reverberance, the repeated measures ANOVA showed a significant effect of BRIRs, F(8, 160) = 41.86, p< 0.001, η p 2 $\eta _{\rm p}^{2}$ = .68. The real loudspeaker was rated with a median of 55. Humanized BRIRs such as “meas:Human” and “sim:Indiv” showed a tendency to be rated dryer, with the highest deviation from the real loudspeaker for “sim:Indiv” (t[20] = 3.5, p = 0.038, d = 0.76). The manipulated conditions were rated as expected: the auralisation with increased reverberation was rated wetter than the loudspeaker (t[20] = −7.04, p< 0.001, d=−1.54), the auralisation with decreased distance was rated dryer (t[20] = 4.68, p = 0.001, d = 1.02). The highest (i.e. wettest) score was reached for the “anchor” condition (t[20] = 9.54,p< 0.001, d = 2.08).

The repeated measures ANOVA showed a significant main effect of BRIRs for the attribute tone colour (F(2.99, 59.80) = 17.90, p< 0.001, η p 2 $\eta _{\rm p}^{2}$ = .47). The loudspeaker was rated with a median score of 50. BRIRs based on generic HRTFs such as meas:HATS with and without HMD as well as “sim:HATS” had the tendency to be rated brighter compared to the real loudspeaker. This is particularly pronounced for the measured BRIRs using the HATS HRTFs, rated as brightest (t=−6.75, p< 0.001, d=−1.22). The score of “humanised” BRIRs and simulated BRIRs were comparable to the real loudspeaker, with “sim:Indiv” rated darker. The “anchor” condition was rated as darkest.

For loudness, a significant effect of BRIRs was found (F(4.79, 95.88) = 36.68, p< 0.001, η p 2 $\eta _{\rm p}^{2}$ = .65). All conditions were rated as loud as the real loudspeaker except for “sim:Distance” and the “anchor” condition. Participants rated “sim:Distance” as significantly louder than the real loudspeaker (t[20] = −6.54, p< 0.001, d=−1.43), and the “anchor” as softer (t[20] = −7.54, p< 0.001, d=−1.64). The louder “sim:Distance” can be explained by the manipulation procedure: The loudspeaker was moved 2 m closer to the listener in the room acoustic simulation without adapting the loudness afterwards.

The Friedman test indicated that the null hypothesis of equal ratings for BRIRs should be rejected for source distance (p< 0.001). The majority of the auralisation ratings were comparable to those of the real loudspeaker, with a rating slightly more distant than optimal. The manipulated condition “sim:Distance” was perceived closer than the real loudspeaker (p = 0.0027). This was targeted when designing this condition. In contrast, “sim:Wet” had the tendency to be rated more distant than the loudspeaker. This can be attributed to the higher reverberation, which is likely to cause a sound source to be perceived as further away. The “anchor” condition was rated as closest. This can be explained by the lack of externalisation for this auralization; participants mostly perceived this condition internalised, which then was rated as closer than externalised auralisations. However, a large variability can be seen for this auralisation. Only a few participants rated the “anchor” condition as more distant than optimal. An explanation can be found in the interview, in which some of the participants stated that there was an interplay between reverberance and the softer loudness. High reverberation and a soft loudness are likely to cause a sound source to be perceived as further away.

In the following, specific aspects of the results are analysed in more detail. Even though only a selection of BRIRs will be discussed at a time, the Bonferroni-Holm corrections always refer to the whole set of BRIRs.

3.1 Getting close to real

Figure 6 shows the data for the BRIRs based on measurements obtained from HATS or from a human, and the BRIRs based on simulations using either HATS HRTFs or individual HRTFs, each together with the real loudspeaker. In other words, all BRIRs except the intended manipulations and the “anchor” are included. This allows multiple comparisons to be made, such as loudspeaker versus virtual, measurement versus simulation, HATS BRIRs versus individual or humanised BRIRs.

thumbnail Figure 6

Box plots showing ratings averaged over the two repeated measurements as a function of room acoustic attributes. Generic BRIR data are compared to an individual or human dataset for both measured and simulated BRIRs with respect to the real loudspeaker rating. Significant differences are shown from the dot to each downward tick on the respective line. This data is a subset of Figure 5, and the Bonferroni-Holm correction has been applied to the whole data set.

3.1.1 Loudspeaker versus virtual

A significant difference between the different auralisations and a real loudspeaker were found for tone colour compared to “meas:HATS” and for reverberance compared to “sim:Indiv”. The “meas:HATS” was perceived as being brighter (t[20] = −6.75, p< 0.001, d = −1.47), “sim:Indiv” as being dryer (t[20] = 3.5, p = 0.038, d = 0.76). All other combinations of BRIR set and attribute were fairly close to the real loudspeaker. It can be said that most of the underlying auralisations were well suited to providing a comparable experience to the loudspeaker. Nevertheless, “sim:Indiv” was perceived to be drier and, in a notable but statistically insignificant way, tended to be darker and closer than the real loudspeaker.

3.1.2 Measured versus simulated BRIRs

The direct comparison of measurement and simulation is possible for generic HATS data only, since there is no measured counterpart for “sim:Indiv” (“meas:Human” refers to one individual person only). It can be seen that measured BRIRs were rated significantly brighter than simulated BRIRs (t[20] = 5.4, p<0.001, d = 1.18). There is also a difference in tone colour between “meas:HATS” and “meas:Human”, which also exists for “sim:HATS” and “sim:Indiv”, which will be discussed in detail in Section 3.1.3.

For the measurement versus simulation analysis, the difference between generic HATS data and “humanised” BRIR data is interesting. It can be seen that the difference is slightly smaller when the simulation is used. Furthermore, when comparing measurement and simulation, it is important to place them in the context of the real-world condition that these methods are intended to address. This resembles Section 3.1.1, where the individual simulated condition had the tendency to be slightly darker. With this and the smaller difference between generic and humanised BRIRs, it can be stated that the simulations tended to sound slightly darker than the measurements.

3.1.3 Human versus HATS BRIRs

Two comparisons can be made regarding the differences between BRIRs based on HATS and BRIRs based on a real person. For the measured BRIRs, it is possible to compare HATS to human, for the simulation it is possible to compare HATS to individual HRTF data. As the BRIR measurements were not individual, a direct comparison is only possible for simulated BRIRs. It was observed that participants rated “meas:HATS” as significantly brighter than “meas:Human” (t[20] = 5.8, p< 0.001, d = 1.26). This significant difference was also seen when comparing “sim:HATS” to “sim:Indiv” (t[20] = 3.88, p = 0.019, d = 0.85). Therefore, the generic BRIRs used here were perceived as brighter, regardless of whether they were measured or simulated. It is noteworthy that there was a tendency for measured generic BRIRs to be rated higher than humanised BRIRs in terms of reproduction quality, although this was not statistically significant. This difference was barely seen for simulated BRIRs. It seems that by measuring BRIRs with a human, the increased brightness compared to the real loudspeaker condition observed with ”meas:HATS” could be reduced. However, there is some influence that tends to decrease the reproduction quality. This effect is not found for the simulation when the HRTF data is individual. Some participants said that minor localisation errors had occurred for ”meas:Human”, which had led them to rate the reproduction quality lower than it otherwise would have been.

3.2 Influence of wearing an HMD

Figure 7 shows the comparison between the measured BRIRs with and without the HMD headphone combination. There are no statistical differences between the two. There was a tendency for the reproduction quality to be rated higher with the HMD headphone combination than without. The only attribute that might explain this is loudness, with a tendency for the BRIRs with HMD to be rated softer than without. However, it is possible that the differences in reproduction quality are not represented by the given attributes. For this reason, in the interview participants were asked to directly compare these two BRIRs. Most participants reported that these stimuli were identical; they could not hear a difference. Participants who could hear a difference found it difficult to describe the perceived difference, and were unable to name an attribute to describe what they perceived.

thumbnail Figure 7

Box plots showing ratings averaged over the repeated measurements as a function of room acoustic attributes. BRIRs based on measurements with and without HMD headphone combination. This data is a subset of Figure 5, and the Bonferroni-Holm correction has been applied to the whole data set.

3.3 Influence of artificial modifications

The introduced changes in distance and reverberance served as a modification that would help to better interpret ratings given by the listeners. Figure 8 shows the average ratings of the manipulated conditions “sim:Wet” and “sim:Distance”, compared to the “sim:Indiv” condition and the real loudspeaker for each attribute. As intended, the “sim:Wet” condition was rated significantly higher in terms of reverberance than all other BRIRs in this comparison. Additionally, ”sim:Wet” was perceived as louder (t[20] = 5.14, p = 0.001, d=−1.12) and more distant (p = 0.045) than the “sim:Indiv” BRIRs. These differences may explain why the reproduction quality was also rated as lower compared to “sim:Indiv” (t[20] = 6.38, p< 0.001, d = 1.39). Furthermore, participants indicated in the interview that the auralisation was too reverberant for the room. It was perceived as being more pronounced than the other auralisations, and participants were clearly able to perceive an increase in reverberation. It is noteworthy that they also perceived the resulting loudness increase that accompanied the increase in reverberation.

thumbnail Figure 8

Box plots showing ratings averaged over the repeated measurements as a function of room acoustic attributes. Introduced manipulations, such as increasing reverberation or decreasing distance, are compared to the real measurement and the simulation without manipulation. Significant differences are shown from the dot to each downward tick on the respective line. This data is a subset of Figure 5, and the Bonferroni-Holm correction has been applied to the whole data set.

Furthermore, participants perceived the “sim:Distance” as closer than all other BRIRs in this comparison. The loudness increase resulting from this manipulation was also identified by the participants, by rating it as louder. In comparison to the real loudspeaker, participants rated “sim:Distance” as being dryer (t[20] = 4.68, p = 0.001, d = 1.02), which is potentially due to the natural side effect that the direct-to-reverberant ratio is increased by this manipulation. Interestingly, this mismatch in position had no effect on reproduction quality: the variance for reproduction quality was comparably high. In the interview participants answered differently. Some participants expressed a preference for the more direct, dry, and intelligible sound of “sim:Distance”, while others argued that the distance of the sound was not appropriate for the visible scene.

3.4 Externalisation and Plausibility

In addition to the room acoustic attributes, externalisation and plausibility were rated. As these attributes were assessed categorically, a Pearson's chi-squared test was performed on the count data. The proportion of stimuli perceived as externalised (χ2(16, N = 21) = 127.77, p< 0.001) and plausible (χ2(16, N = 21) = 113.77, p< 0.001 differed significantly by BRIRs. The highest externalisation scores were obtained for the real loudspeaker and BRIRs based on measurements (95% or higher). The BRIRs with the decreased distance “sim:Distance” were significantly more likely to be perceived as internalised, with about 31% (χ2 = 6.33, p = 0.042). The “anchor” condition was most often perceived as internalised with about 93% (χ2 = 35.53, p< 0.001). When asked about their reasons for rating stimuli as internalised, some participants said that they were reluctant to rate only 1 out of 9 choices as internalised, so they occasionally chose to rate more than one as internalised. This may also explain why the real loudspeaker was sometimes reported as being internalised. As “sim:Distance” was closer to the listener, participants may have not only perceived this distance manipulation as closer to the head, but also interpreted it as internalised.

thumbnail Figure 9

Bar chart of ratings averaged over the repeated measurements as a function of BRIRs for externalisation and plausibility.

In addition, to determine which auralisations were perceived as real and appropriate to the visual presentation, the concept of plausibility was assessed. It is important to note that plausibility was rated on a binary scale in a multiple comparison, differing from the concept proposed by Lindau and Weinzierl [13]. Participants provided introspective ratings, which were based on their internal references. Consequently, these ratings were influenced by individual interpretations of plausibility. Furthermore, the presentation of other BRIR sets could affect plausibility ratings. A good overview of the different plausibility paradigms and their definitions is found in [60].

The authors previously evaluated most of the underlying BRIRs for plausibility using a yes/no task, as described by [13], and found that most BRIRs were considered plausible [19]. In the current work, a direct, binary rating of plausibility was used. Although the direct rating of the plausibility measure is expected to depend on participants’ internal criteria, with this approach we still were able to obtain insight into the relative degree of perceived plausibility between the conditions we measured. To reduce the influence of the BRIR set played before or after, the order of playback was randomised for each participant and each repetition.

Most of the auralisations were rated as plausible, with scores ranging from 80% to 95%. “meas:Human” was more likely to be rated as not plausible (38%) compared to the real loudspeaker (χ2 = 9.79, p < 0.01). Some participants reported a slight difference in azimuthal localisation compared to other auralisations. It is important to note that these BRIRs are not individual but “humanised”. This means that for individual participants, these BRIRs are a generic data set. Typical problems with generic HRTFs, such as localisation errors, may explain the higher number of implausible ratings.

A reduced plausibility can also be seen for “sim:Distance” (χ2 = 8.13, p = 0.017) and “sim:Wet” (χ2 = 8.42, p = 0.015). Most of the other auralisations heard were similar regarding plausibility, whereas the introduced manipulations created a larger difference. It should be noted that the manipulation of the room acoustic simulation was not as dramatic as to be completely implausible.

All participants rated the “anchor” condition as not plausible. This can be explained by the low score for externalisation, as well as the huge difference for almost all attributes compared to all other auralisations.

Despite the large number of plausible ratings, some small implausible ratings remain for the real loudspeaker (5%), “meas:HATS” (about 15–20%), “sim:HATS”, and “sim:Indiv” (both 10%). When asked, the same effect as for externalisation was reported as observed, with more than one being rated as implausible. Mostly the conditions with introduced manipulations as well as the “anchor”, were rated as not plausible, but sometimes others as well. This may explain why the real loudspeaker was also rated as implausible in single cases.

4 Discussion

4.1 Close to real binaural auralisations

Overall, the subjective ratings given for the binaural auralisation, both measured and simulated, were in good agreement with the real loudspeaker. The deviations between the loudspeaker and the compared auralisations are much smaller than the introduced manipulations in reverberance and source distance. The auralisations were largely judged to be plausible when rated on a binary scale, which has also been previously shown with a yes-no paradigm after Lindau and Weinzierl [13] for these auralisations [19]. In addition, externalisation was also achieved for most conditions. One noteworthy exception was that the BRIRs measured with the HATS were judged to be brighter than the real loudspeaker. There was also a tendency for the simulations to be perceived as darker, which will be discussed in detail in Section 4.3.

4.2 Need for individual HRIRs?

Recently discussions suggested that in certain scenarios there is no need for individual HRTFs, e.g., using speech stimuli in auralised classroom scenarios [20]. This finding may be related to the use of head-tracking and the use of speech signals, which was found to be less critical compared to, e.g., noise [12]. Especially in the field of room acoustics, speech plays an important role. In the present study, participants rated the reproduction quality, plausibility, and externalisation for most of the auralisations as comparable to the real loudspeaker in the room. A difference can be seen for the tone colour attribute, where participants rated BRIRs based on HATS data as brighter than individual BRIRs. This colouration effect is, however, reduced by individualisation, as seen for “sim:HATS” and “sim:Indiv”, but also by measuring the BRIRs with a human. A possible explanation for the higher ratings in brightness for the artificial head could be the lack of hair, high skin impedance, and absence of clothing when using the HATS.

There seems to be a decrease in plausibility and reproduction quality when using the generic humanised data set. When measuring BRIRs with a human, microphone placement and movement during the measurement are likely to influence the reproduction of binaural signals. An overview of sources of error was summarised by Brinkmann et al. [12]. It is also possible that the interaural differences are more deviant compared to the HATS, because only one specific head diameter of a male human was used, which could have been too large for some of the smaller participants. For the HATS used the diameter is relatively small compared to a typical human [61, 62]. Some participants reported that they perceived a localisation error compared to the visualised loudspeaker for the “meas:Human” BRIRs. This could also indicate errors due to measurement inaccuracies, or interaural differences compared to the individual data.

Note that measured generic BRIRs and simulated BRIRs based on measured generic HRTFs can be used for auralised classroom scenarios, except for the tone colour difference. Even though the tone colour was perceived as brighter, the generic HATS data was more plausible, and rated higher in reproduction quality than the generic humanised BRIRs. Most of the time, the auralisations were judged to be plausible using a binary rating scale. This finding is consistent with an earlier investigation [19] using similar conditions and the blind yes-no paradigm proposed by Lindau and Weinzierl [13]. In addition, no front-back confusion was reported during the interviews. Overall, the generic data worked well for the scenario using a speech stimulus, real room acoustics, and head-tracking. As measured generic BRIRs differ mainly in tone colour, they can be considered to be a valid reference if the tone-colour difference were resolved.

4.3 Measurement versus simulation

Both auralisations based on simulations and measurements were in good agreement with the real loudspeaker in terms of source distance, loudness, and reproduction quality. However, there was a difference for tone colour when comparing the real loudspeaker to “meas:HATS”. Moreover, a difference existed between the methods depending on whether the BRIRs were based on humanised or generic datasets, with the generic being rated brighter than the humanised. This difference was comparably large for both auralisations, with a tendency to be slightly smaller for the simulations. However, there was a tendency for the ratings to shift towards “darker” for the simulations “sim:HATS” were rated closer to the actual loudspeaker than “sim:Indiv”. As the real loudspeaker naturally perceived on an individual basis, “sim:Indiv” should theoretically be closer to the real loudspeaker, and “sim:HATS” should be rated brighter. Consequently, it may be assumed that although there was generally good correspondence to the real loudspeaker in most ratings, the simulation is slightly too dark in tone colour.

4.4 Influence of HMD headphone combination

It is known that HWDs are likely to influence both colouration [25] and localisation [24, 63]. In more complex scenes, however, no effect was found when psychoacoustic measures were evaluated [64]. Therefore, it remains unclear whether the influence of HWDs can be perceived in a real-life scenario. In this study, no difference was found in the ratings of room acoustic attributes with or without HMD headphone combination. It should be noted that this was true for a loudspeaker at 0° using a speech stimulus in a virtual classroom scenario. It should also be noted that this study accounted for the influence of the HMD headphone combination by measuring both BRIRs and HRTFs while wearing the HMD headphone combination. The comparison with BRIRs obtained without HMD headphone combination was carried out to see if there were any differences in the current experiments. This was important, because it was a direct comparison to a real loudspeaker in the room. If this were not the case, the influence could be neglected. When evaluating room acoustics prior to construction in a VR scenario, where the influence of a real loudspeaker does not need to be considered, the influence of the HMD headphone combination can be neglected.

4.5 What must be done before using auralisations prior to construction?

As this study focuses on using binaural auralizations to assess room acoustics prior to construction, it is important to identify what is missing to achieve the goal. A significant concern is the room-divergence effect, which occurs when room acoustics are perceived in an environment that differs from the actual room to be evaluated [21]. It is necessary to examine whether visualization through an HMD and a room model of the auralized room could help to reduce the room-divergence effect. More specifically, knowledge of the observer about the actual (diverging) room in which the rendering is provided, may potentially influence the way the virtually rendered audiovisual room is perceived. So far, the experiments have been conducted in the actual room while wearing an HMD that displays a 3D model of the same room, avoiding any room divergence. It will be interesting to investigate whether this approach will be effective when the same auralisations are assessed using this setup in a different room. Additionally, it must be noted that the room-acoustic simulation was adjusted to a measured monaural RIR. This adjustment is not feasible if the room being investigated has not yet been built. There is a significant dependency on the accuracy of the estimated reverberation time and the location-dependent absorption coefficients for the success of auralisations based on a room-acoustic simulation.

5 Conclusion and outlook

These results permit the conclusion that close to real binaural auralisations can be achieved for both measured and simulated BRIRs. Head tracked binaural auralisations are well suited for adding real-world audio to VR. When absorption coefficients are accurately tuned, the simulation ratings closely match those of a real loudspeaker in a real room. Although there are differences in tone colour, BRIRs based on generic HATS data can be used to achieve an externalised and plausible auralisation when rated on a binary scale, which is less time-consuming compared to using an individual data set. Except for tone colour, there is no benefit in personalising the data. In addition, the HMD headphone combination does not affect the ratings for a measured BRIR set.

Further research is required to validate the use of visual room models presented via an HMD to substitute the visual perception in a real room. It needs to be demonstrated that room acoustic ratings with IVEs are not influenced by the physical room in which they are presented.

Acknowledgments

We thank our subjects for participating in the study. English language services were provided by stels-ol (desmosa@gmx.de).

Funding

This work was funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) under the project ID 422686707, SPP2236 – AUDICTIVE, subproject 444832396.

Conflicts of interest

Author declared no conflict of interests.

Data availability statement

The raw data of the rating results and the transcripts of the interview associated with this article are available in Zenodo [65].

References

  1. F. Stärz, L.O.H. Kroczek, S. Roßkopf, A. Mühlberger, S. Van De Par, M. Blau: Comparing room acoustical ratings in an interactive virtual environment to those in the real room, in Proceedings of the Forum Acusticum 2023, Turin, Italy, 11–15 September, European Acoustics Association, 2023, pp. 5010–5016 [Google Scholar]
  2. M. Kleiner, B.I. Dalenbäck, P. Svensson: Auralization – an overview. Journal of Audio Engineering Society 41 (1993) 861–875 [Google Scholar]
  3. S. Roßkopf, L.O.H. Kroczek, F. Stärz, M. Blau, S. Van de Par, et al. 2024. The impact of binaural auralizations on sound source localization and social presence in audiovisual virtual reality: converging evidence from placement and eye-tracking paradigms. Acta Acustica, 8, 72. https://doi.org/10.1051/aacus/2024064 [CrossRef] [EDP Sciences] [Google Scholar]
  4. T. Lokki, V. Pulkki: Evaluation of the geometry-based parametric auralization, in Proceedings of the International Conference on Virtual, Synthetic and Entertainment Audio, Espoo, Finland, 15–17 June, Audio Engineering Society, pp. 367–376 [Google Scholar]
  5. J. Rindel, C. Christensen: Room acoustic simulation and auralization – how close can we get to the real room? keynote lecture, in Proceedings of the Eighth Western Pacific Acoustics Conference, Melbourne, Australia, 7–9 April, 2003 [Google Scholar]
  6. B.N.J. Postma, B.F.G. Katz: Perceptive and objective evaluation of calibrated room acoustic simulation auralizations. Journal of the Acoustical Society of America 140 (2016) 4326–4337 [Google Scholar]
  7. F.C. Tommasini, O.A. Ramos, M.X. Hüg, S.P. Ferreyra: A computational model to implement binaural synthesis in a hard real-time auditory virtual environment. Acoustics Australia 47 (2019) 51–66 [CrossRef] [Google Scholar]
  8. A. Lindau, V. Erbes, S. Lepa, H.-J. Maempel, F. Brinkman, S. Weinzierl: A spatial audio quality inventory (SAQI). Acta Acustica united with Acustica 100 (2014) 984–994 [CrossRef] [Google Scholar]
  9. L. Aspöck, Validation of room acoustic simulation models. PhD Thesis, RWTH Aachen University, 2020 [Google Scholar]
  10. F. Brinkmann, L. Aspöck, D. Ackermann, S. Lepa, M. Vorländer, S. Weinzierl: A round robin on room acoustical simulation and auralization. Journal of the Acoustical Society of America 145 (2019) 2746–2760 [CrossRef] [PubMed] [Google Scholar]
  11. E. Hendrickx, P. Stitt, J.-C. Messonnier, J.-M. Lyzwa, B.F. Katz, C. de Boishéraud: Influence of head tracking on the externalization of speech stimuli for non-individualized binaural synthesis. Journal of the Acoustical Society of America 141 (2017) 2011–2023 [CrossRef] [PubMed] [Google Scholar]
  12. F. Brinkmann, A. Lindau, S. Weinzierl: On the authenticity of individual dynamic binaural synthesis. Journal of the Acoustical Society of America 142 (2017) 1784–1795 [Google Scholar]
  13. A. Lindau, S. Weinzierl: Assessing the plausibility of virtual acoustic environments. Acta Acustica united with Acustica 98 (2012) 804–810 [CrossRef] [Google Scholar]
  14. S.A. Wirler, N. Meyer-Kahlen, S.J. Schlecht: Towards transfer-plausibility for evaluating mixed reality audio in complex scenes, in AES International Conference on Audio for Virtual and Augmented Reality (AVAR), Virtual, 17–19 August, 2020 [Google Scholar]
  15. J. Oberem, B. Masiero, J. Fels: Experiments on authenticity and plausibility of binaural reproduction via headphones employing different recording methods. Applied Acoustics 114 (2016) 71–78 [CrossRef] [Google Scholar]
  16. T. Lübeck, C. Pörschmann: Evaluating the plausibility of non-individual head-related transfer functions in anechoic conditions, in Proceedings of the Forum Acusticum 2023, Turin, Italy, 11–15 September, European Acoustics Association, 2023, pp. 6205–6211 [Google Scholar]
  17. C. Pike, F. Melchior, T. Tew: Assessing the plausibility of non-individualised dynamic binaural synthesis in a small room, in 55th International AES Conference, Helsinki, Finland, 27–29 August, 2014 [Google Scholar]
  18. A. Neidhardt, C. Schneiderwind, F. Klein: Perceptual matching of room acoustics for auditory augmented reality in small rooms – literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 [Google Scholar]
  19. F. Stärz, L.O.H. Kroczek, S. Roßkopf, A. Mühlberger, S. Van De Par, M. Blau: Perceptual comparison between the real and the auralized room when being presented with congruent visual stimuli via a head-mounted display. in Proceedings of the 24th International Congress on Acoustics (ICA), Gyeongju, Korea, 24–28 October, 2022 [Google Scholar]
  20. M. Blau, A. Budnik, M. Fallahi, H. Steffens, S.D. Ewert, S. Van De Par: Toward realistic binaural auralizations – perceptual comparison between measurement and simulation-based auralizations and the real room for a classroom scenario. Acta Acustica 5 (2021) 8 [CrossRef] [EDP Sciences] [Google Scholar]
  21. K. Brandenburg, S. Werner, F. Klein, C. Sladeczek: Auditory illusion through headphones: history, challenges and new solutions. Proceedings of the Meetings on Acoustics, 28 (2016) 050010 [CrossRef] [Google Scholar]
  22. M. Slater, S. Wilbur: A framework for immersive virtual environments (FIVE): speculations on the role of presence in virtual environments. Presence: Teleoperators and Virtual Environments 6 (1997) 603–616 [CrossRef] [Google Scholar]
  23. S. Roßkopf, L.O.H. Kroczek, F. Stärz, M. Blau, S. Van De Par, A. Mühlberger: The effect of audio-visual room divergence on the localization of real sound sources in virtual reality, in Fortschritte der Akustik (DAGA), Hamburg, 6–9 March, 2023 [Google Scholar]
  24. D. Satongar, C. Pike, Y.W. Lam, A.I. Tew: The influence of headphones on the localization of external loudspeaker sources. Journal of the Audio Engineering Society 63 (2015) 799–810 [CrossRef] [Google Scholar]
  25. C. Schneiderwind, A. Neidhardt, D. Meyer: Comparing the effect of different open headphone models on the perception of a real sound source, in 150th Convention of the Audio Engineering Society, Online, 25–28 May [Google Scholar]
  26. P. Lladó, T. McKenzie, N. Meyer-Kahlen, S. Schlecht: Predicting perceptual transparency of head-worn devices. Journal of the Audio Engineering Society 70 (2022) 585–600 [CrossRef] [Google Scholar]
  27. F. Stärz, S. Roßkopf, A. Mühlberger, L. Kroczek, S. Par, M. Blau: Acoustically transparent headphones as an add-on for a head-mounted display, in Fortschritte der Akustik (DAGA), Hannover, 18–21 March, 2024 [Google Scholar]
  28. F. Denk, F. Brinkmann, A. Stirnemann, B. Kollmeier: The PIRATE: an anthropometric earPlug with exchangeable microphones for individual reliable acquisition of transfer functions at the ear canal entrance, in Fortschritte der Akustik (DAGA), Rostock, Germany, 18–21 March, 2019 [Google Scholar]
  29. T. Wendt, S. Van de Par, S.D. Ewert: A computationally-efficient and perceptually-plausible algorithm for binaural room impulse response simulation. Journal of the Audio Engineering Society 62 (2014) 748–766 [CrossRef] [Google Scholar]
  30. H. Steffens, S. Van de Par, S. Ewert: Perceptual relevance of speaker directivity modelling in virtual rooms, in Proceedings of the 23rd International Congress on Acoustics, Aachen, Germany, 9– 13 September, Deutsche Gesellschaft Fuer Akustik, 2019, pp. 2651–2658 [Google Scholar]
  31. J.B. Allen, D.A. Berkley: Image method for efficiently simulating small-room acoustics. Journal of the Acoustical Society of America 65 (1979) 943–950 [Google Scholar]
  32. J-M. Jot, A. Chaigne, “Digital delay networks for designing artificial reverberators,” in Proc. 90th AES Convention, Paris, France, Feb. 1991 [Google Scholar]
  33. P. Majdak, P. Balazs, B. Laback: Multiple exponential sweep method for fast measurement of head-related transfer functions. Journal of the Acoustical Society of America 55 (2007) 623–637 [Google Scholar]
  34. A. Novák, L. Simon, F. Kadlec, P. Lotton: Nonlinear system identification using exponential swept-sine signal. IEEE Transactions on Instrumentation and Measurement 59 (2010) 2220–2229 [CrossRef] [Google Scholar]
  35. F. Brinkmann, S. Weinzierl: AKtools – an open software toolbox for signal acquisition, processing, and inspection in acoustics, in AES 142nd Convention, Berlin, Germany, 20–23 May (Engineering Brief 309, 2017 May). Available at http://www.aes.org/e-lib/browse.cfm?elib=18685 [Google Scholar]
  36. V.R. Algazi, C. Avendano, R.O. Duda: Estimation of a spherical-head model from anthropometry. Journal of the Acoustical Society of America 49 (2002) 472–479 [Google Scholar]
  37. P. Majdak, Y. Iwaya, T. Carpentier, R. Nicol, M. Parmentier, A. Roginska, Y. Suzuki, K. Watanabe, H. Wierstorf, H. Ziegelwanger, M. Noisternig: Spatially oriented format for acoustics: a data exchange format representing head-related transfer functions, in AES 134th Convention, Rome, Italy, 4–7 May [Google Scholar]
  38. E. Sengpiel: Gleichungen für die pegeldifferenz- und laufzeitdifferenz-lokalisationskurve [Online]. Available at https://sengpielaudio.com/GleichungenDLundDt.pdf [Google Scholar]
  39. F. Klein, S. Werner, T. Mayenfels: Influences of training on externalization of binaural synthesis in situations of room divergence. Journal of the Acoustical Society of America 65 (2017) 178–187 [Google Scholar]
  40. O. Kirkeby, P. Nelson: Digital filter design for inversion problems in sound reproduction. Journal of the Audio Engineering Society 47 (1999) 583–595 [Google Scholar]
  41. M. Paquier, V. Koehl: Audibility of headphone positioning variability, in 128th Audio Engineering Society Convention, Londres, United Kingdom, 22–25 May, 2010 [Google Scholar]
  42. H. Jäger, J. Bitzer, U. Simmer, M. Blau: Echtzeitfähiges binaurales rendering mit bewegungssensoren von 3d-brillen, in Fortschritte der Akustik (DAGA), Kiel, Germany, 6–9 March, 2017 [Google Scholar]
  43. H. Jäger, U. Simmer, J. Bitzer, M. Blau: Time-variant overlap-add in partitions [Online]. Available at http://arxiv.org/abs/2310.00319 [Google Scholar]
  44. N. Meyer-Kahlen, M. Kastemaa, S. J. Schlecht, T. Lokki: Measuring motion-to-sound latency in virtual acoustic rendering systems. Journal of the Audio Engineering Society 71 (2023) 390–398 [CrossRef] [Google Scholar]
  45. D.S. Brungart, A.J. Kordik, B.D. Simpson: Effects of headtracker latency in virtual audio displays. Journal of the Audio Engineering Society 54 (2006) 32–44 [Google Scholar]
  46. S. Weinzierl, S. Lepa, D. Ackermann: A measuring instrument for the auditory perception of rooms: the room acoustical quality inventory (RAQI). Journal of the Acoustical Society of America 144 (2018) 1245–1257 [CrossRef] [PubMed] [Google Scholar]
  47. D. Leckschat, C. Epe, Aufnahmen von Sprecherinnen und Sprechern zur Verwendung in der Virtuellen Akustik, (Feb. 20, 2020). https://doi.org/10.5281/zenodo.3601086 [Google Scholar]
  48. EBU: EBU r 128 loudness normalisation and permitted maximum level of audio signals [Online]. Available at https://tech.ebu.ch/docs/r/r128_2011_DE.pdf [Google Scholar]
  49. MathWorks: Audio toolbox user's guide. The MathWorks, Inc. 2024, https://de.mathworks.com/help/pdf_doc/audio/audio_ug.pdf [Google Scholar]
  50. P. Chevret, E. Parizet: An efficient alternative to the paired comparison method for the subjective evaluation of a large set of sounds. in 19th International Congress on Acoustics, Madrid, Spain, 2–7 September, 2007 [Google Scholar]
  51. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024. Available at http://www.R-project.org/ [Google Scholar]
  52. W. Revelle: psych: Procedures for psychological, psychometric, and personality research, version 2.4.3. Comprehensive R Archive Network. 2024, https://CRAN.R-project.org/package=psych [Google Scholar]
  53. M.A. Lawrence: ez: Easy analysis and visualization of factorial experiments, version 4.4-0. Comprehensive R Archive Network. 2016, http://github.com/mike-lawrence/ez [Google Scholar]
  54. S.S. Shapiro, M.B. Wilk: An analysis of variance test for normality (complete samples). Biometrika 52 (1965) 591–611. https://doi.org/10.2307/2333709 [CrossRef] [Google Scholar]
  55. J.P. Royston: An extension of shapiro and wilk's w test for normality to large samples. Journal of the Royal Statistical Society: Series C (Applied Statistics) 31 (1982) 115–124 [Google Scholar]
  56. S. Holm: A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6 (1979) 65–70 [Google Scholar]
  57. R.V. Hogg, J. Ledolter: Engineering statistics. Macmillan, 1987 [Google Scholar]
  58. M. Hollander, D.A. Wolfe, E. Chicken: Nonparametric statistical methods. John Wiley & Sons, 2015 [CrossRef] [Google Scholar]
  59. J.D. Gibbons, S. Chakraborti: Nonparametric statistical inference, in M. Lovric (ed.) International Encyclopedia of Statistical Science. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 977–979 [CrossRef] [Google Scholar]
  60. N. Meyer-Kahlen: Transfer-plausible acoustics for augmented reality. Doctoral thesis, Aalto University, 2024 [Google Scholar]
  61. K. Genuit: Ein modell zur beschreibung von aussenohrübertragungseigenschaften (A model for describing transfer-functions of the outer ear). Dissertation, RWTH Aachen, 1984 [Google Scholar]
  62. K. Genuit, A. Fiebig: Do we need new artificial heads? in 19th International Congress on Acoustics 2007 (ICA 2007), Madrid, Spain, 2–7 September, 2007 [Google Scholar]
  63. A. Ahrens, K.D. Lund, M. Marschall, T. Dau: Sound source localization with varying amount of visual information in virtual reality. PLoS One 14 (2019) e0214603 [CrossRef] [PubMed] [Google Scholar]
  64. S. Fichna, T. Biberger, B.U. Seeber, S.D. Ewert: Effect of acoustic scene complexity and visual scene representation on auditory perception in virtual audio-visual environments, in Immersive and 3D Audio: from Architecture to Automotive (I3DA), Bologna, Italy, 8–10 September, IEEE, 2021, pp. 1–9 [Google Scholar]
  65. F. Stärz, S. Van de Par, S. Roßkopf, L. Kroczek, A. Mühlberger, M. Blau: Supplementary material – Blind comparison of binaural auralisations to a real loudspeaker in an audiovisual virtual classroom scenario: Effect of room acoustic simulation, HRTF dataset and head worn devices on rated room-acoustical attributes. 2024, https://zenodo.org/records/12543044, DOI: 10.5281/zenodo.12543044 [Google Scholar]

Cite this article as: Stärz F. Van De Par S. Roskopf S. Kroczek L.O.H. Mühlberger A. & Blau M. 2025. Comparison of binaural auralisations to a real loudspeaker in an audiovisual virtual classroom scenario: Effect of room acoustic simulation, HRTF dataset, and head-mounted display on room acoustic perception. Acta Acustica 9, 31. https://doi.org/10.1051/aacus/2025012.

All Tables

Table 1

Overview of used BRIR sets.

All Figures

thumbnail Figure 1

Room under investigation. Top: A HATS at the listening position within the room. Bottom: The virtual model of the identical room, displayed through a HMD during the listening test.

In the text
thumbnail Figure 2

Graphical user interface for participants to interact in VR using sliders to rate different auralisations for different attributes. In this case, the attribute being rated is reproduction quality. Wiedergabequalität = Quality of the reproduction; Hoch = high; gering = low, Ordnen = Order.

In the text
thumbnail Figure 3

A block diagram providing an overview of the steps involved in realising the head-tracked binaural auralisations. Open headphones were used to allow comparison with a real loudspeaker in the real room.

In the text
thumbnail Figure 4

Objective data derived from all BRIRs used in this study. Left: reverberation time (T20) expected to be the same for all BRIR sets except for sim:Wet; Right: A-weighted third-octave band sound pressure levels after convolving the speech stimulus used in the listening test with the BRIRs.

In the text
thumbnail Figure 5

Box plots showing the ratings averaged over the repeated measurements as a function of the BRIRs for each rated room-acoustic attribute. Statistically significant differences (repeated measures ANOVA, for source distance non-parametric Wilcoxon signed rank test, both corrected by Bonferroni-Holm) are presented, starting from the dot and refer to each downward tick on the respective line. The corresponding significance asterisk is placed below the line near the downward tick.

In the text
thumbnail Figure 6

Box plots showing ratings averaged over the two repeated measurements as a function of room acoustic attributes. Generic BRIR data are compared to an individual or human dataset for both measured and simulated BRIRs with respect to the real loudspeaker rating. Significant differences are shown from the dot to each downward tick on the respective line. This data is a subset of Figure 5, and the Bonferroni-Holm correction has been applied to the whole data set.

In the text
thumbnail Figure 7

Box plots showing ratings averaged over the repeated measurements as a function of room acoustic attributes. BRIRs based on measurements with and without HMD headphone combination. This data is a subset of Figure 5, and the Bonferroni-Holm correction has been applied to the whole data set.

In the text
thumbnail Figure 8

Box plots showing ratings averaged over the repeated measurements as a function of room acoustic attributes. Introduced manipulations, such as increasing reverberation or decreasing distance, are compared to the real measurement and the simulation without manipulation. Significant differences are shown from the dot to each downward tick on the respective line. This data is a subset of Figure 5, and the Bonferroni-Holm correction has been applied to the whole data set.

In the text
thumbnail Figure 9

Bar chart of ratings averaged over the repeated measurements as a function of BRIRs for externalisation and plausibility.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.