Issue |
Acta Acust.
Volume 9, 2025
Topical Issue - Virtual acoustics
|
|
---|---|---|
Article Number | 6 | |
Number of page(s) | 18 | |
DOI | https://doi.org/10.1051/aacus/2024068 | |
Published online | 21 January 2025 |
Scientific Article
Comparison of speech intelligibility in a real and virtual living room using loudspeaker and headphone presentations
* Corresponding author: julia.schuetze@uol.de
Received:
17
May
2024
Accepted:
1
October
2024
Virtual acoustics enables hearing research and audiology in ecologically relevant and realistic acoustic environments, while offering experimental control and reproducibility of classical psychoacoustics and speech intelligibility tests. Hereby, indoor environments are highly relevant, where listening and speech communication frequently involve multiple targets and interferers, as well as connected adjacent spaces that may create challenging acoustics. Hence, a controllable laboratory environment is evaluated here (by room acoustical parameters and speech intelligibility) which closely resembles a typical German living room with an adjacent kitchen. Target and interferer positions were permuted over four different locations, including an acoustically challenging position of a target in the kitchen with interrupted line of sight. Speech intelligibility was compared in the real room, in virtual acoustic representations, and in standard anechoic audiological configurations. Three presentation modes were tested: headphones, loudspeaker rendering on a small-scale, four-channel loudspeaker array in a sound-attenuated listening booth, and a three-dimensional 86-channel loudspeaker array in an anechoic chamber. The results showed that the target talker in the coupled room requires higher signal to noise ratios (SNRs) at threshold than typical indoor conditions. Moreover, for the stationary speech shaped interferer, effects of room acoustics were negligible. For a majority of target positions, no difference between the four-channel and the large-scale loudspeaker array were found, with an overall good agreement to the real room. This indicates that ecologically valid testing is feasible using a clinically applicable small-scale loudspeaker array.
Key words: Virtual acoustics / Ecological validity / Speech audiometry
© The Author(s), Published by EDP Sciences, 2025
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
Complex acoustic environments are highly relevant for hearing research because they provide a realistic and ecologically valid context for understanding how people perceive and interact with sound in their daily lives [1]. In contrast, traditional psychoacoustic research has often relied on simplified, artificial stimuli that do not represent the complexity of real-world acoustic environments, and instead offer highly-controlled, reproducible acoustic conditions (e.g., tone-in-noise masking [2]; amplitude modulation detection [3]; minimum audible angle [4]). Realistic acoustic environments often consist of numerous non-stationary sound sources at varying locations. Additionally, in real-world environments reverberation and echoes affect how sounds are perceived [1, 5]. While real-world listening tests in the field offer a high degree of ecological validity, they require considerable organizational effort and the acoustic conditions during the test are challenging to document and control. Instead, reproduction of complex life-like acoustic environments in a laboratory setting has been proposed as a step towards more ecologically valid testing [1] in hearing research. Here, ecologically validity refers to “the degree to which research findings reflect real-life hearing-related function, activity, or participation” [1]. Recent technological developments allow for listening experiments using virtual acoustics, where listeners are presented with different scenes using headphones or loudspeaker arrays, e.g. [6–8]. Hereby, virtual acoustics relies on simulations that use simplified models of the real world. A comparison between the real world and simulated acoustic environments is a crucial step towards validation of the simulated environment before implementation in the laboratory and broader application, e.g. [9]. It is still unknown to which degree testing in virtual acoustics reflects hearing functions in everyday life (see also [10]). It is therefore important to conduct listening tests in the real world and in its virtual representation which are directly comparable in order to evaluate and verify the usefulness of virtual acoustics in hearing research.
Regarding acoustic conditions, two factors differentiate realistic conditions from “classical” psychoacoustics and audiometric tests: Spatially distributed interfering sound sources contribute to scene complexity [11] and in enclosed spaces, sound is affected by reverberation. In everyday situations, a talker of interest is likely spatially separated from one or more interfering noise sources, introducing interaural phase and level differences of target and noise, and enabling binaural unmasking and better ear listening [12–16]. In standard audiological tests, the effect of binaural hearing is usually tested with speech audiometry in conditions with a frontal speech target and a co-located (S0N0) or spatially separated interferer at an angle of 90° (S0N90; [17]). Most speech-in-noise intelligibility tests utilize some form of broadband interferer (e.g. stationary speech-shaped noise, SSN) or babble noise, as they are controllable and adjustable. However, these advantages come at the cost of relative predictability and lack of realism, which makes results less transferable into real-life [18]. When interferers with temporal gaps are utilized in speech-in-noise tests, dip-listening [14, 19] and reverberation [20] can further affect speech intelligibility.
In reverberant environments, the direct sound is superimposed with discrete early reflections from different directions and diffuse late reverberation. A positive influence of early reflections on speech intelligibility has been observed [21], whereas late reverberation has a detrimental effect by reducing fluctuations of the speech envelope [14, 22]. The Speech Transmission Index (STI, [23]) is a common measure to quantify detrimental effects of reverberation and noise on the speech envelope and thereby on speech intelligibility. As a consequence of reverberation, dip-listening in the troughs of temporally fluctuating interferers might be hampered. To a certain degree, normal-hearing listeners can compensate for the impact of reverberation on speech signals, resulting in speech intelligibility remaining unaffected in environments with a “moderate” amount of reverberation [24]. Here, Brandewie and Zahorik [24] showed, that after being exposed to the reverberant environment, speech reception thresholds improved, which indicates some level of compensation for the reverberation. They used the term “moderate” to describe a room that approximates the acoustic properties of a large office, with a broadband reverberation time (T60, [25]) of 0.42 s and a broadband clarity index (C50, [26]) of 13.4 dB, representing the ratio of the energy in the early reflections over the energy in the late reflections.
Recent technological progress enables the rendering of static and interactive sounds in various conditions using room acoustic simulations with reproduction via headphones or multiple loudspeakers, forming virtual acoustic environments. Their use in hearing research [27–35] may enable more ecologically valid testing compared to traditional paradigms, enabling “controlled realism” with lab-based reproduction of prototypical or existing real-world environments. Speech intelligibility has been measured in several virtual acoustic environments: in an office meeting scenario reproduced using higher-order Ambisonics [36] recordings, rendered on a 64-channel spherical loudspeaker array [37], and in a simulated reverberant cafeteria in a 3-D loudspeaker array with 41 equalized loudspeakers [29]. The influence of reverberation on speech detection and localization in a multi-talker environment has been investigated by Buchholz and Best [38]. Direct comparisons between real rooms and their virtual reproductions have been performed: Cubick et al. [30] compared speech intelligibility in a real room and two different virtual reproductions. The virtual reproductions were auralized via a spherical array of 29 loudspeakers using Ambisonics or the nearest loudspeaker method and have been validated in terms of room acoustic parameters and speech intelligibility with normal-hearing listeners by a comparison to the real environment. Reverberation time (T30) and speech clarity (C50) were well preserved in the virtual sound environment, however, slight deviations, especially at lower frequencies, were observed. Speech recognition thresholds (SRTs) were 2–4 dB higher in the virtual sound environment than in the real room, indicating, that the virtual environment was more challenging for speech intelligibility. Kondo et al. [39] compared speech intelligibility in a real room and virtual reproductions with individualized head related transfer functions (HRTFs) and with HRTFs of a KEMAR dummy head. They reported that speech intelligibility was best in the real space. Discrepancies were most pronounced for spatial configurations with a frontal target and a competing babble noise from behind. Here, for the virtual acoustic environment a 10% larger degradation of the chance-adjusted correct response was observed. Potential reasons for the discrepancy are limitations of HRTFs and simplifications in the virtual environment, as well as listener adaptation. A comparison of speech intelligibility between a standard listening room, which was used as reference room, and measured and simulated reproductions with a spherical 64 loudspeaker array has been performed for a reverberant reference room by Ahrens et al. [40]. The simulation was created with the commercially available acoustic simulation software ODEON. One rendering used nearest-loudspeaker mapping and a second rendering used a mixed-order Ambisonics coding strategy. In terms of SRTs, the reproduction based on measured room impulse responses (RIRs) matched best with the reference room, while for spatially separated target and interferer, the two simulated reproductions showed significantly lower SRTs compared to the reference room. For co-located target and interferer, no differences were found between the two simulations and the reference room. Ahrens et al. attributed the observed discrepancies to errors in the early reflections of their simulations. Other studies compared measured binaural room impulse responses (BRIRs) of real environments with simulations: Rychtáriková et al. [41] compared speech intelligibility with measured and simulated BRIRs of a large reverberant room (9.3–7.1 s reverberation time for low frequencies and 2.7–1.3 s for octave bands up to 8 kHz) with no parallel walls and three different spatial sound scenarios (S0N0, S0N90, and S0N180). Speech intelligibility was found to be better in the anechoic condition compared to the reverberant condition, in which the differences between the three spatial sound scenarios were small. An overall good agreement was found between the results with simulated and measured BRIRs. Hladek et al. [42] showed a good agreement for speech intelligibility obtained with measured and simulated BRIRs of an underground station. In Schütze et al. [43] a first approach to a direct comparison between standard audiological conditions and recorded BRIRs from a real room was presented for a (stationary) SSN interferer and a limited set of test conditions. Here, slightly increased SRTs were found for the target in the coupled room, while the other target positions matched well with S0N0.
While certain characteristics of acoustic environments, such as interferers and reverberation can reduce speech intelligibility for normal-hearing listeners, these characteristics have been shown to have a larger impact on listeners with hearing loss [44]. Nevertheless, similar to classical psychoacoustics, hearing loss and individual abilities are commonly assessed using pure tone audiometry and speech recognition in quiet and in noise [45–50], thus assuming quasi-anechoic sound propagation. Discrepancies between proposed and in real-life experienced benefit has been reported [51] with audiological tests often poorly predicting speech intelligibility in real-world conditions [52]. Accordingly, the use of daily-life acoustic situations in the laboratory has been proposed for more ecologically valid testing [1].
In an effort to provide an openly available starting point for increased ecological validity in hearing research and audiology, van de Par et al. [53] provided three diverse complex acoustic environments with ground truth acoustic measurements and 3D models [54–56]. So far, a comparison of recorded and simulated acoustic conditions has only been performed for one of the environments, the underground station [42]. As for speech intelligibility, comparisons of binaural and energetic parameters showed little differences between simulation and measurement. One other environment described in van de Par et al. [53] is a living room, which is highly relevant for acoustic communication in daily life [11, 56]. In contrast to larger scale enclosed spaces, the reverberation time in living rooms with a typical volume between 50 and 60 m3 ranges from 0.63 s for 125 Hz to 0.38 s for 4000 Hz [57]. From such low reverberation times, only minor effects on speech intelligibility are expected, at least for normal hearing listeners. However, a critical condition, particularly for hearing-impaired listeners, occurs for communication across connected rooms with an interrupted line of sight to the sound source. Schulte et al. [58] evaluated the occurrence and significance of 22 different acoustic everyday situations for hearing impaired through a questionnaire. These acoustic situations ranged from following the news on TV, being a passenger in a car, to understanding the public announcements in a loud and reverberant train station. Notably, being addressed from an adjacent room ranked second highest in importance, preceded only by watching the news on television.
When comparing (static) binaural presentations via headphones to loudspeaker reproductions as used in the literature, different advantages and disadvantages arise: Loudspeaker reproduction allows for natural head movements and enables the listener to wear hearing aids during the measurement. While binaural headphone presentations are unaffected by the acoustics of the listening room, loudspeaker reproductions typically require an acoustically treated or ideally anechoic listening room. Another practical constraint involves the number and spatial distribution of loudspeakers, with greater numbers generally offering improved reproduction quality [28].
Based on the current body of literature it remains unclear, how speech intelligibility measures compare between a real room with different spatial arrangements of target and interferer, and standard audiological test configurations in an anechoic listening booth. Moreover, it is unclear how well virtual acoustics rendered with small-scale loudspeaker arrays, as applicable in a clinical context, can reproduce results obtained in a real room. Towards establishing ecologically valid testing methods in the laboratory, it is important to compare i) speech intelligibility in the real-life environment and in its (virtual) acoustic reproduction and to ii) assess the effect of the reproduction system for the virtual environment, e.g., using headphones or loudspeakers. Additionally, iii) it is important to relate performance in real-life rooms to those obtained in (well-established) standard audiological test conditions.
In the current study, extending Schütze et al. [43], we directly compare speech intelligibility in an ecologically relevant real-life condition to different acoustic reproductions of the measured and simulated room, as well as to speech audiometry test conditions in the same group of normal-hearing listeners. For this, the “living room lab” (or laboratory) was established in the university building (see also [53]) as a real-life environment resembling a typical German living room.
In Experiment 1, SRTs were obtained in a listening test using different loudspeakers as sound sources in the real room, providing the “ground truth” data. SRTs were compared to those obtained with acoustic reproductions using measured or simulated BRIRs. The key questions were whether i) room acoustics can critically affect speech intelligibility in the common home environment of a living room, and whether ii) results obtained in the real living room can be matched with loudspeaker and headphone reproductions. For this, four different source positions were defined in the living room lab; three target positions, including one in the adjacent kitchen room connected by an open door, without line of sight, and one interferer position. A fluctuating nonsense speech interferer was used. Besides binaural reproductions of the recordings, the room acoustics simulation was presented in a small-scale 4-channel horizontal loudspeaker array in a listening booth, as well as in a large-scale spherical 86-channel loudspeaker array in an anechoic chamber (“virtual reality lab”). Hereby, the loudspeaker array in the listening booth resembles a reproduction system typically available at a commercial hearing aid distributor or in a clinical context. The large-scale system offers better 3-dimensional reproduction without undesired acoustic effects of the reproduction room itself. For the binaural reproduction via headphones, the identical simulation was presented in three different listening locations, including the real living room, the virtual reality lab, and listening booth. Here, the goal was to establish whether the surroundings in which listeners performed the headphone experiment, including the real room, affect the results.
In Experiment 2, the key question was how speech intelligibility in the real living room acoustics compares to that obtained in “classical” anechoic conditions with frontal and lateral spatial positions of the interferer. For this, SRTs reproduced via headphones with measured BRIRs from the living room lab were compared to standard audiological spatial configurations in simulated free-field conditions. In addition to the previously mentioned fluctuating nonsense speech interferer, more closely resembling spectro-temporal masking effects, including dip listening, measurements were additionally conducted using (stationary) SSN.
To closer assess the room acoustical conditions in the real room and to interpret SRTs obtained in the listening tests using recorded and simulated BRIRs, several room acoustical parameters were calculated based on measured and simulated BRIRs, including early decay time, clarity index C50, and direct to reverberant ratio.
2 Methods
2.1 Acoustic environment and conditions
The living room lab was designed and built at the University of Oldenburg to closely resemble a typical German living room with an adjacent smaller (kitchen) room, connected by a door. The living room has the dimensions 4.97 m × 3.78 m × 2.71 m (width × length × height), while the kitchen is 4.97 m × 2.00 m × 2.71 m and the door opening has a width of 0.97 m. The overall volume of both rooms is 77.85 m3. The living room lab is furnished with a sofa and chairs, a coffee table, and a television set on a TV board. Other furniture can be seen in the bottom panel of Figure 1. The flooring is laminate, and a carpet is placed underneath the sofa and coffee table.
![]() |
Figure 1 Top panel: floor plan of the living room lab at the University of Oldenburg. The coupled (kitchen) room is located on the left, connected to the living room by an open door. The receiver position R is indicated by a head symbol on the couch. The different sound source positions are indicated by loudspeaker symbols. The interferer is color coded red, and the target positions are purple. Dimensions are given in meters. The eye symbol in the bottom right corner indicates the position from which the photo shown in the bottom panel was taken. |
For this study, listeners were seated on the couch indicated by position R in Figure 1. The interferer was placed at the fixed position S4. The target talker was presented from one of three positions: directly in front of the listener (STV), to the right of the listener on a chair (S5), and in the coupled kitchen room without a direct line of sight (S7). The target and interferer on the chairs had a height of 1.02 m, while the targets STV and S7 had a height of 1.10 m. In the simulation and the BRIR measurement, the ears of the receiver were located at a height of 1.05 m. For target and interferer, Genelec 8030 CP loudspeakers were located at the respective positions [53, 59]. BRIRs were recorded using a G.R.A.S. KEMAR type 45BM head and torso simulator (see also [59]). The living room lab and the room acoustical measurements are described in more detail in [53] and [59], where the measured room impulse responses are freely available.
The target position STV was located at a distance of 2.51 m from the receiver at an angle of 0°. The interferer position S4 and the target position S5 were in 1.58 m distance from the receiver, at an angle of ±45°. The largest distance of 5.69 m from the receiver occurred for target position S7. S7 was located in the coupled room and only diffracted sound from this target position reached the receiver.
2.2 Room acoustics simulation
Simulated BRIRs and loudspeaker renderings were obtained using the room acoustic simulator RAZR [60] with extensions for coupled rooms [61]. Here, early reflections were calculated using an image source model of two connected proxy shoebox rooms and the reverberant tail is based on a feedback delay network. Interior objects, such as furniture, were neglected in the shoebox approximation. The room dimensions and the (measured) reverberation time T30 (reflecting the duration for the decay from −5 dB to −35 dB of the initial level, which is then extrapolated to a decay time of 60 dB) in each of the rooms, in octave bands, were used as input parameters for the simulation. Furthermore, a homogeneous distribution of absorption within the respective rooms was assumed. The directivity and spectral characteristics of the Genelec 8030 CP loudspeakers was accounted for using a set of impulse responses, consisting of a total of 1680 directions distributed on a sphere, measured at a distance of 2.85 m in an anechoic chamber. Vector base amplitude panning (VBAP) was employed for rendering to the different loudspeaker arrays. For binaural rendering of the simulated room impulse responses, the RWTH Aachen High-Resolution Head-Related Transfer Function Data Set of KEMAR [62] was utilized. Since the KEMAR head-and-torso simulator used for the BRIR recordings in the current study was equipped with ear canal simulators, and the HRTF database [62] was measured without ear canal simulators, a spectral post-processing was applied to the simulated BRIRs: The average spectral difference for the positions STV, S4 and S5 was determined and a set of parametric compensation filters was derived. With those compensation filters applied, the smoothed spectra of the simulated BRIRs matched those of the measured BRIRs within a range of ±3 dB in the frequency range from 0.25 to 12 kHz.
2.3 Stimuli, apparatus, and procedure
Speech intelligibility was measured using the AFC framework [63] with the German matrix sentence test (Oldenburger Satztest, OLSA; [64]) in the presence of an interfering nonsense speech talker using a male-transformed version [65] of the international speech test signal (ISTS; [66]). The ISTS consists of continuous speech composed from six different talkers in six languages (Arabic, Chinese, English, French, German and Spanish). The here used male transformed version of the ISTS matches the fundamental frequency of the target speech of the matrix sentence test, and is referred to as ISTSmale in the following. The interferer had an overall duration of 54 s, from which a randomly selected segment of 5 s was extracted for the presentation of each matrix sentence. While the interferer level was fixed, the level of the target signal was varied adaptively depending on the number of correctly identified words. The matrix sentence test was presented as a closed test, so that listeners were able to see the matrix of possible entries and entered their response by selecting words from the matrix on a touchscreen. The matrix consisted of ten different names, verbs, numbers, adjectives and nouns, which are randomly combined for a target sentence. Reported SRTs relflect the signal to noise ratio (SNR) at which 50% of the words were correctly identified.
The target on position STV was calibrated to 65 dB SPL at the receiver position using a stationary noise with the same spectrum as the sentences of the matrix sentence test.
Distance-related level differences between the different (spatial) target positions (STV, S7, S5) were compensated by applying the mean SNR for both ears of the measured BRIRs (see Tab. 1) to the respective SRTs results. Thus, the remaining differences in SRT can be solely attributed to spatial differences and other room-related effects, disregarding level differences related to the distance between the listener and the target and interferer.
Overview of broadband room acoustic parameters, T30, Direct to Reverberant Ratio (DRR), SNR and Speech transmission index (STI). T30 and DRR are shown for the measured and simulated BRIRs for the four different source positions in the living room lab. Speech Transmission Index for the measured and simulated BRIRs for the three different target positions in the living room lab (for both channels). Broadband SNRs for the three different target positions for measured and simulated BRIRs (left and right channels shown individually). SNRs in the bottom two rows (indicated with an asterisk) include the level adaptation.
In Experiment 1, speech intelligibility of a loudspeaker presentation in the real living room was compared to different acoustic reproductions of said environment, resulting in seven measurement conditions (see Tab. 2 for an overview): a loudspeaker presentation in the real living room, a horizontal 4 loudspeaker array in a listening booth, a three-dimensional 86 loudspeaker array in the virtual reality (VR) lab, measured reproductions presented with headphones in the living room lab, and simulated reproductions presented with headphones in the living room lab, the listening booth and the virtual reality lab. To assess the influence of the visual environments, measurements using headphone reproductions took place in all locations.
Overview of the different modes of presentation in both experiments.
The “ground truth” set-up in the real living room lab consisted of 4 loudspeakers (from which 2 were active in each measurement) at the positions STV, S7, S5 and S4 (LRL-LS), which are shown in Figure 1. Seated in position R1 (see Fig. 1), the listeners were additionally tested with the simulation via binaural rendering on headphones (LRL-HP). The second set-up was a small-scale, horizontal 4-channel loudspeaker array (45°, 135°, 225°, 315°) placed on a radius of 1.0 m in a listening booth (LB-LS). In the listening booth, additionally, both the simulation was presented via headphones (LB-HP) and also the measured BRIRs of the living room were used for headphone reproduction (HATS). The third location was a 3-dimensional spherical 86-channel loudspeaker array in an anechoic chamber for rendering the simulation (VR-LS; see [67] for more details). Again, the simulation was additionally presented via headphones while participants were sitting in the loudspeaker array (VR-HP).
For presentations with headphones, Sennheiser HD 650 were used, while set-ups with loudspeakers used Genelec 8030 CP. In the measurement in the listening booth, a RME Fireface UCX, in the living room a RME Fireface UFX, and in the VR lab, a RME Madiface XT audio interface drove loudspeakers and headphones at a sample rate of 44,100 Hz. In headphone-based conditions, head movements were not considered.
In the headphone-based Experiment 2, two anechoic standard spatial configurations with frontal target and either co-located or separated interferer at an angle of 90° relative to the listeners’ view direction (S0N0 and S0N90) were compared to the conditions in the living room, rendered using the measured BRIRs from Experiment 1. For the standard spatial configurations, measured HRIRs of the cortex MK2 head and torso simulator were applied, which were obtained with the same measurement set-up as presented in Brinkmann et al. [68]. The stimuli were presented via Sennheiser HD 650 headphones. The measurements were carried out in a double-walled, sound attenuating listening booth. Here, speech intelligibility was measured with the same procedure as in Experiment 1. Measurements were repeated for two interferers: (stationary) SSN and ISTSmale. In both Experiments 1 and 2, the participants performed the listening tests twice (test, retest), on two different days. Prior to starting the measurements, all listeners visited the living room lab for familiarization with the environment and received a training.
2.4 Listeners
Twelve normal hearing native German speakers (six females, six males; aged 19–31 years; median age of 25 years), who were not naïve to the OLSA, participated in Experiment 1. A different group of twelve normal hearing native German speakers (eight females, four males; aged 18–27 years; median age of 22 years), who were naïve to the OLSA matrix sentence test, participated in Experiment 2. The listeners had hearing thresholds for pure tones ≤20 dB HL at the frequencies 125, 250, 500, 750, 1000, 1500, 2000, 3000, 4000, 6000 and 8000 Hz.
3 Results
3.1 Experiment 1
The correlation coefficient between the results of test and retest was R = 0.926. Therefore, the average of both sessions is used in the following as individual SRT estimates.
Figure 2 shows the SRTs (SNR at 50% speech intelligibility) measured with the matrix sentence test (OLSA) in dB. The different presentation modes are shown on the abscissa. Results in the real living room lab, which can be considered the ground truth, are shown on the left in black. Loudspeaker-based reproductions of the simulated room acoustics in the listening booth and in the virtual reality lab are shown in different shades of blue. Different headphone-based renderings are shown in shades of red in the right-hand part of Figure 2. Here, HATS indicates the reproduction using measured BRIRs, with which SRTs were measured using headphones in the listening booth, while the remaining renderings used the simulated room acoustics. In Figure 2, the results in the real living room lab with loudspeakers are shown in black (LRL-LS). Loudspeaker reproductions with simulated RIRs and headphone reproductions with measured and simulated BRISs are shown in blue and red color tones (LB-LS: listening booth with 4-channel loudspeaker array, VR-LS: virtual reality lab with 86-channel array, HATS: listening booth with headphone reproduction using measured BRIRs, room simulations presented with headphones: LRL-HP: living room with headphones, LB-HP: listening booth with headphones, VR-HP: virtual reality lab with headphones).
![]() |
Figure 2 Average SRTs and inter-individual standard deviations of all presentation methods for the three target positions STV, S5, S7 (top to bottom panel). The horizontal dotted line indicates a SRT of −15 dB as a reference for comparing across panels. The horizontal black bars and asterisks indicate significant differences between conditions. |
Each panel of Figure 2 corresponds to a different (spatial) target position: the top panel shows results for the target position STV (indicated by triangles), the middle panel shows results for the target position S5 (indicated by dots). In the bottom panel, the results for the target position in the coupled room S7 (indicated by squares) are shown: The interferer position always was S4, and the interferer signal was ISTSmale.
SRTs for LRL-LS are −16.4 dB for target position STV and −14.4 dB for S5. Although the long-term SNR and therefore distance related level differences were removed from the data, it is obvious that SRTs are higher (−12.9 dB) for target position S7. Here, no direct line of sight (or sound path) exists, thus involving sound propagation through the door opening. The mean SRTs for the different presentation methods deviate by approximately 1.8 dB from the SRT measured in the real room (LRL-LS). The simulated presentations tend to result in similar SRTs compared to LRL-LS for target positions STV and S7, and slightly lower SRTs for S5 (difference of 3.0 dB).
A two-way repeated measures ANOVA (target position × presentation mode) showed a significant main effect of presentation mode [F(6, 66) = 14.95, p < 0.001], target position [F(2, 22) > 100, p < 0.001], and a significant interaction [F(12, 123) = 21.16, p < 0.001]. Statistically significant posthoc pairwise comparisons (using Bonferroni correction) for the different presentation modes are indicated by the horizontal bars and asterisks (*p ≤ 0.05, **p ≤ 0.01, ***p ≤ 0.001) in Figure 2.
Regarding the target position STV, differences to LRL-LS are in the range of 0.2 dB for VR-LS and −2.5 dB for LB-LS and significant differences mostly occur between LB-LS and other presentations. For target position S5, differences to LRL-LS are in the range of 1.9 dB for HATS and 3.6 dB for VR-LS and significant differences mostly occur between LRL-LS and other presentations. Differences to LRL-LS are in the range of 0.9 dB for LB-LS and −1.8 dB for HATS. For target position S7 most significances occur between HATS and other presentations.
3.2 Experiment 2
In Experiment 2, SRTs were measured via headphone presentation for the two standard spatial conditions S0N0 and S0N90, and for the three different living room target positions based on the measured BRIRs (HATS). The test-retest correlation was R = 0.964, and therefore average results of the two sessions are shown. In Figure 3, the mean SRTs over all subjects and the inter-individual standard deviations are shown. On the abscissa, the (spatial) target positions are indicated. Results for the (fluctuating) ISTSmale interferer are in the left panel, whereas the right panel shows the results for (stationary) SSN. Generally, the SRTs measured with ISTSmale are lower than those measured with SSN, except for target position S7. The standard deviations of the individual results tend to be larger for ISTSmale, likely due to the temporal fluctuations. The average SRT for target position S7 measured with the ISTSmale shows about 0.9 dB difference to the results for HATS in Experiment 1, which is within the error margin of a single standard deviation. For target position STV and S5, there are small deviations (about 0.3 dB and about 0.1 dB, respectively). In Experiment 2, for both interfering signals the spatial configuration with the highest SRT is S7, the lowest SRT is measured at S0N90. The average SRT for S0N0 is lower compared to STV and S5 with ISTSmale, while it is similar to the echoic conditions with SSN.
![]() |
Figure 3 Mean SRTs and inter-individual standard deviation for ISTSmale and SSN (left and right panel, respectively) at the three target positions and in the standard audiological configurations indicated on the x-axis: STV, S5, S7 and S0N0, S0N90 (from left to right). |
A two-way repeated measures ANOVA showed significant main effects of source position [F(4, 44) = 210.55, p < 0.001], noise type [F(1, 11) = 209.50, p < 0.001], and a significant interaction [F(4, 44) = 20.32, p < 0.001]. Significant differences were found between most presentations, except for ISTS for STV, S5, and S0N0 and for SSN for STV and S0N0, where the mean difference is 0.11 dB.
4 Technical evaluation
In the following, the acoustic conditions in the living room are evaluated based on measured room impulse responses. Furthermore, simulated room impulse responses are compared to the measurements, which will later be discussed in context with the results for speech intelligibility.
4.1 Binaural room impulse responses
In Figure 4, each panel shows the first 110 ms of the measured (upper traces, blue: left and red: right ear) and simulated BRIRs (lower traces) for the four different source positions S4, S5, S7 and STV. The early reflection patterns in the simulated BRIRs are generally similar to the measurements. The shoebox model approximation in the room acoustics simulation, which neglects all interior objects and their surfaces in the room, such as furniture, leads to some deviations in the initial part of the impulse response, most apparent for the STV source position (lower right panel). Here, several early reflections from furniture are visible in the measured BRIR: At approximately 0.64 ms after the direct sound impinging at 7 ms in the BRIR from the glass surface of the couch table and at approximately 1.7 ms after the direct sound (at about 8 ms in the BRIR) from the TV behind the loudspeaker. These early reflections, leading to an apparently more temporally diffused direct sound peak in the measurement are not present in the simulated BRIR. Also for S7, the primary reason for such deviations is the strongly simplified geometry in the simulation with two connected shoebox rooms, disregarding details such as the depth of the door opening, furniture, and the protrusion on the wall next to the door (see Fig. 1).
![]() |
Figure 4 Comparison of measured and simulated BRIRs with normalized amplitudes for the four different positions in the living room lab. Measured BRIRs are shown in the top half of each panel, simulated BRIRs are shown in the bottom half. The blue and red color indicate the left and right ear, respectively. |
Figure 5 shows the third-octave smoothed spectra of measured (solid) and simulated BRIRs (dashed) for the four different source positions. In general, a good agreement between measured and simulated BRIRs can be observed. Due to the locations of the sound sources, the dominant energy for high frequencies arrives at the left ear for S4, and vice versa at the right ear for S5. For STV, the long-term interaural level difference is very small, since the sound source is located at an azimuth angle of 0° relative to the view direction of the listener. For S7, less high-frequency energy impinges from the right, indicating that diffracted and reflected sound, which is traveling through the door opening from the kitchen, is dominant.
![]() |
Figure 5 Third-octave smoothed spectra of the measured (solid) and the simulated BRIRs (dashed) for the four different source positions, showing the left (blue) and right ear (red). For better comparability, the simulated BRIRs were scaled to match the level of the measured BRIRs. |
4.2 Reverberation time and early decay time
The reverberation time characterizes the temporal decay of energy in a room, after switching off the source of a stationary sound field. The reverberation time was calculated from the measured BRIRs as T30 in octave bands between 250 Hz and 8 kHz in the energy decay curve.
The early decay time (EDT) is a decay-rate measure, describing the time in which the initial part of the energy decay curve decreases from 0 dB to −10 dB. During the early phase of the impulse response, direct sound and early reflections from nearby surfaces both contribute to perceived sound. These early reflections have been reported to improve speech intelligibility [69]. EDT and the reverberation time can differ, and EDT is considered to be a better measure for the subjective reverberance than the reverberation time [70]. According to Vorländer [71], the just noticeable differences (JNDs) for T30 and EDT are 5%.
Figure 6 shows the EDT and T30 results for the four different source positions for the measured and simulated BRIRs. For S4, S5, and STV, the frequency dependent values T30 values are close to 0.5 s, with a tendency to gradually decrease to about 0.4 s for higher frequencies above 3 kHz. T30 tends to be similar for both ears. For position S7, higher T30 values of about 0.7 s are observed, with a maximum value of about 0.8 s at 3 kHz. The EDT (faint traces) tends to be shorter compared to T30 at target positions S4, S5, and STV. Only for S7, positioned in the coupled room, the EDTs are noticeably higher than T30 with a maximum value of about 1 s at 3 kHz. Generally, a good agreement between the measured (bold solid) and simulated (bold dotted) T30 is observed. T30 was an input parameter for the simulation and should thus be matched, when averaging over multiple locations. Accordingly, Table 1 (top row) shows a good agreement of the broadband measured and simulated T30, in line with the expected reverberation time of about 0.5 s in typical living rooms [57]. In contrast to T30, the EDT of the simulation was not directly controlled by the input parameters. Deviations between the measured (light solid) and simulated (light dotted) EDT can be observed for all positions, as generally simulated BRIRs tend to have a shorter EDT, with larger deviations for source position S7.
![]() |
Figure 6 Reverberation time T30 (bold traces) and early decay time (EDT; faint traces) for both ears of the simulated and measured BRIRs as functions of octave band center frequencies. |
4.3 Clarity and direct to reverberant ratio
The clarity index C, introduced by Reichardt et al. [26], was intended to characterize the transparency of music in a concert hall. The clarity index for speech C50 is defined as the ratio of the early energy of the impulse response (between 0 and 50 ms) to the late reverberant energy in the impulse response after 50 ms.
where p(t) is the room impulse response [72]. The higher the value of C50, the higher is the impression of clarity. According to Bradley et al. [73] the JND for C50 is 1.1 dB.
Figure 7 shows the frequency-dependent C50 values for the four different source positions in the living room lab. As expected, clarity values are the lowest for the target position S7 in the coupled kitchen room with values of about 0 dB. Across all target positions, there is a trend that clarity values tend to be higher for high frequencies, which is likely a consequence of faster decaying reverberant tails. This tendency is exaggerated by up to about 10 dB at 8 kHz by the simulated BRIRs for S4 and S5, which are otherwise in good agreement with the measurements at lower frequencies. For S7, the simulated BRIRs overestimate C50 by about 5 dB. At STV, C50 is underestimated by about 5 dB for most frequencies, likely due to the shoebox approximation disregarding early reflections from the coffee table and television in the real room. A difference between measurements and simulations might possibly surpass the JND depending on the source signal.
![]() |
Figure 7 C50 for measured (solid) and simulated (dotted) BRIRs for the left (blue) and right (red) ear. |
Another measure relating the direct sound alone (instead of the early part of the RIR) to the reverberation is the direct-to-reverberant energy ratio (DRR). With increasing distance, the direct sound energy decreases while the level of the late reverberant field stays approximately constant. For large distances, the reverberant tail can mask parts of the direct sound and smear transients. Perceptually, the DRR serves as a distance cue [74]. The DRR was calculated as follows:
where p(t) is the room impulse response, t0 is the time of the peak that corresponds to arrival of the direct sound and tw is the duration of a temporal window around that contains only the direct sound. For all DRR values, the window length tw was manually determined and ranged from 1.2 ms to 5.2 ms.
In Table 1, DRRs are given for the three source positions S4, S5 and STV as measured with HATS and simulated. As no direct sound from the source position S7 reaches the receiver position, no DRR was estimated for this source position. In the measured BRIRs, the DRR at STV is similar at both ears with about −1.3 dB to −1.1 dB, whereas at S4 and S5, the DRR is considerably higher with about 5.1 dB in the ears that face the respective source position. At the ears facing away from the respective sources, the DRR is decreased to about −9 dB due to head shadowing. For STV, the DRR is about 2–3 dB larger in the simulation compared to the measurement in both ears, while for S5 and S4 it is about 2 dB higher in the left and right ear, respectively. With the DRR being 2 dB higher, this results in a difference of about 1 JND. According to Larsen et al. [75], the JND of DRR is approximately 2 dB at a DRR of 0 dB, between approximately 2 dB to 4 dB at 10 dB DRR and approximately 6 dB at −10 dB DRR.
The reason for the observed differences between simulation and measurement is two-fold: A comparison of the reverberant energy in the simulation revealed an overall 2 dB lower value compared to the measurement. This causes the overall higher DRR observed in the simulations. The underlying reason is that the reverberation time of the simulation was matched to that of the measurement, and the wall absorption coefficients for the simulation are derived from the reverberation time using an inversion of Sabine’s equation. This procedure typically matches the level of the reverberation or so-called “reverberation strength” only reasonably well, depending on the room. A second factor, which causes the correct DRR of the simulation for the left and right ear of S4 and S5, respectively, are about 2 dB larger interaural level differences for the direct sound in the dummy head HRTFs on the opposite ear, which effectively “compensate” for the lower simulated diffuse late reverberation level. Taken together, the current differences of about 2 dB in the DRR originate from an “of-the-shelf” application of the current room acoustics model, without further parameter refinement, and from the slightly different HRTF dataset available for the simulations than used in the measurement, for which the missing ear canals had to be compensated by an angle independent (common) transfer function (see Sect. 2.2).
4.4 Speech transmission index and signal to noise ratio
The speech transmission index (STI; [23]) is based on the concept that speech can be viewed as an amplitude-modulated signal, in which the extent of modulation carries the speech information. It can characterize the impact of reverberation and noise on a target speech signal in a room. Houtgast and Steenken [23] have shown that the STI is closely related to speech intelligibility, with the STI ranging from 0 (bad) to 1 (excellent). STI values for the measured and simulated BRIRs for the four positions are shown in Table 1, calculated using the SoundZone Matlab Toolbox [76]. For the measured BRIRs, STV has the highest STI of about 0.83, which is closely matched by the simulation. For S5, the STI for the right ear is 0.82 for the measured BRIR, which is slightly overestimated by the simulation. The largest difference between measured and simulated BRIRs is observed for S7, where the measurement has an STI value as low as about 0.62, whereas the simulated version clearly deviates with an STI of about 0.72.
In addition to the STI, the long-term, broadband SNR was calculated with the SSN for the three different target positions and the interferer at position S4. The broadband SNR is shown in the bottom section of Table 1. Values at the better ear are about 1 dB for the measured BRIRs at S5 and STV. At S7, clearly lower values of about −10 dB SNR are observed for both measurement and simulation. In order to make the SNR values more interpretable in context with the SRT results from Section 3, the same level adaption procedure (application of the mean SNR for both ears of the measured BRIRs), has been applied and the corresponding adapted SNRs are shown in the lowest rows of Table 1. The agreement of the SNRs for measured and simulated BRIRs is best for target position S7, while the agreement is lower for STV and shows a maximum deviation of 1.8 dB for the right channel of STV. The simulated BRIRs lead to lower SNRs for S5 and STV.
5 Discussion
The present study investigated speech intelligibility and room acoustic parameters in a real living room and with different modes of presentation, based on measured and simulated room acoustics, using loudspeakers and headphones. Additionally, speech intelligibility was compared to two standard audiological test conditions without reverberation, using co-located and spatially separated stationary and fluctuating interferers.
5.1 Speech intelligibility at different target positions in the living room
The living room lab enables acoustic and perceptual measurements in a real room resembling a typical home environment. Its acoustic properties were assessed based on measured dummy-head BRIRs. The broadband T30 (averaged across both ears) ranges between 0.48 s and 0.49 s for source positions S4, S5, and STV which are located in the living room, and 0.73 s for target S7 in the adjoining room (see Tab. 1). These relatively short reverberation times are expected to hardly affect speech intelligibility compared to anechoic conditions [24]. S7 with occluded direct sound is the most challenging target position with the highest mean SRT for LRL-LS, of −12.9 dB. Since the line of sight is interrupted, only diffracted and/or reflected sound reaches the receiver position through the door opening. Additionally, the door opening is located in a direction close to that of the interferer position S4, so only marginal binaural benefit from spatial separation of target and interferer can be expected.
5.2 Real room and measured BRIRs
For the analysis of speech intelligibility in the real room, the measured BRIRs are required. The SNRs (see Tab. 1) at target position S7 are substantially lower compared to other positions and are symmetrical for the measured BRIRs. A difference in SRT for LRL-LS between STV and S7 of about 3.5 dB cannot purely be explained by SNR (Tab. 1). Differences between S7 and the remaining target positions are not only found in the SRT values, but also in the room acoustic parameters: Strong differences were observed between S7 and the two other target positions for the clarity C50 (Fig. 7). A substantially higher EDT and T30 (Fig. 6) is shown for S7. The obstructed propagation of direct sound is also indicated by a lower STI of 0.6 in comparison to the two other target positions with values of approximately 0.8 (Tab. 1).
Haeussler et al. [77] reported a considerable decrease in speech intelligibility for room-in-room conditions (playing back a sound that was recorded in another room), compared to single rooms with the same T30 and same total distance. They concluded that there are changes in spectral and temporal properties of the room in room condition compared to a conventional room. In our study, even though a clear effect of room acoustics was expected for target position S7, a mild effect (approximately 3.5 dB) remains in the results after compensation of distance-related level and SNR differences.
For STV and S5, speech intelligibility is expected to be mainly affected by the spatial separation between target and interferer position. If the living room lab was anechoic, the larger spatial separation of interferer S4 and target S5 in comparison to STV should result in a lower SRT for S5, due to binaural unmasking (see [15]). However, this is not the case with loudspeaker presentation in the living room (LRL-LS) (see Fig. 2): the mean SRT for STV is −16.4 dB, while it is increased for S5 (−14.5 dB). Probably these unintuitive results can be attributed to the loudspeaker orientation: S5 does not point towards the listener (see also Fig. 1), leading to high-frequency attenuation in the direct sound, while STV is pointed at the listener. However, this effect is not reproduced in the simulation, hinting towards additional useful early reflections that are present in the real room and not in the simplified shoebox representation. Additionally, a reason for the higher SRTs for S5 in the real room might be a slightly shifted loudspeaker orientation in comparison to the dummy head measurements. For STV and S7, slight differences in the orientation are less critical, given that STV directly pointed at the listener and no direct sound from S7 arrived at the listener position.
When comparing the SRTs obtained in the living room (“ground truth”, LRL-LS) with those obtained using the measured BRIRs presented over headphones (HATS), the best agreement was observed for target STV with mean SRT for LRL-LS being lower 0.45 dB than for HATS (see Fig. 2). For reference, the standard deviation between test lists of the matrix sentence test is 0.16 dB [64]. However, for targets S5 and S7 there were significant differences: For S5, the SRT in LRL-LS was 1.93 dB lower and 1.75 dB higher for S7 in comparison to HATS. Even though these differences are significant, they are within the standard deviation of both presentation modes, respectively. These differences between the SRTs obtained in LRL-LS and with measured BRIRs could be related to head movements in LRL-LS, individual HRTFs and variations in the seating position of the listeners during the measurements in the real room.
5.3 Ground truth and simulation
Comparing the ground truth (LRL-LS) to the simulated loudspeaker-based presentations, the SRT differences vary across presentation modes for the three target positions (see Fig. 2): For STV, the SRT deviation between LRL-LS and VR-LS is as small as 0.15 dB, with VR-LS being lower. For S5, the largest difference can be found with a 3.64 dB higher SRT for LRL-LS than for VR-LS. For the small-scale loudspeaker array, LB-LS, the SRTs are 2.52 dB (STV) higher, 2.78 dB (S5) lower and 0.90 dB (S7) lower compared to LRL-LS. This can be explained by the topology of the four-channel array LB-LS, where loudspeakers are placed at angles of 45° (see next paragraph). For target S7, LRL-LS is on average 1.05 dB higher than both simulated loudspeaker presentations. Overall, the average deviation between LRL-LS and the both loudspeaker-based presentations is 1.83 dB. The average deviation is largest for S5 (3.20 dB) and lower for STV and S7 (1.33 dB and 0.97 dB).
Comparing both loudspeaker-based presentations, there is only one significant difference between LB-LS and VR-LS for target position STV. While the “diagonal” 4-channel loudspeaker array can well reproduce diffuse sound fields for front facing (0°) head orientations [67, 78], the target at STV is only reproduced as phantom source between two loudspeakers (using the current VBAP mapping). Here, a fifth loudspeaker frontally placed would be beneficial for a better reproduction with a small-scale loudspeaker array. Ahrens et al. [40] reported for their loudspeaker-based measurement with spatially separated target and interferer, that their two simulated renderings showed significantly lower SRTs compared to the real reference room. However, this was not the case for co-located target and interferer. The lower SRTs for the spatially separated positions might have been related to errors in early reflections. In our study, simulated presentations, which are all spatially separated, generally tend to lead to lower SRTs with an average deviation of 1.58 dB compared to the real room and measured HRTFs (LRL-LS and HATS). Cubick et al. [30] reported a difference of 2 dB between the real room and the virtual sound environment with nearest loudspeaker mapping and a difference of 4.4 dB between the real room and virtual sound environment with higher-order Ambisonics for a target in 2 m distance.
In contrast to the current loudspeaker presentations, the headphone presentations did not allow for head movements. Here, HATS was based on real measurements, with a deviation to the “ground truth” LRL-LS discussed in Section 5.1. For STV, the largest SRT deviation between HATS and the other headphone presentations was 1.02 dB for LB-HP (see Fig. 2). For S5, the maximum is slightly larger with 1.5 dB for VR-HP. The largest deviation can be found for S7, where SRTs for HATS are 3.3 dB higher compared to VS-HP. To explain the deviations between HATS and simulated headphone presentation, the acoustical properties of the measured (HATS) and simulated BRIRs (see Sect. 4) can be considered: For the S5 simulation the clarity index C50 below 0.5 kHz and above 4 kHz is higher than the JND of 1.1 dB in comparison to the measured BRIRs, while the EDT differs more than the JND of 5% for frequencies below 0.25 kHz and above 2 kHz between measured and simulated BRIRs (see Figs. 6 and 7). This might cause the lower SRTs for the simulated BRIRs. For STV, there are differences between simulated and measured BRIRs for the clarity index: C50 is lower for the simulated BRIRs in comparison to the measured BRIRs, which could explain the lower SRTs for HATS STV. The lower SRTs (higher speech intelligibility) for simulated BRIRs do not agree with the STI measure, which is virtually identical for the simulated and measured BRIRs (see Tab. 1). For STV and S5, the SNR with level adaptation is lower for the simulated BRIRs compared to the measured ones, which is not reflected in the SRTs for STV and S5, as the simulation generally led to lower SRTs. Concerning the spatial separation, the average SRT for all presentation modes is about −17.0 dB for target position S5 (largest spatial separation of 120°) and −15.6 dB for STV (smaller spatial separation). Regarding the deviations in SRT for target position S7, it can be observed that the SNR (shown in Tab. 1) is generally in good agreement between measured and simulated BRIRs. Nevertheless, for the simulated BRIR of S7 the better ear has a 1.2 dB better SNR compared to the measured BRIRs, which might explain a part of the lower SRTs observed for the simulation. The STI value for S7 is 0.1 higher for the simulation, thus also in line with the lower SRTs observed for the simulation (shown in Tab. 1). Similarly, the clarity index C50 shows higher clarity values for the simulation and the EDT is substantially lower for the simulated BRIRs (see Figs. 6 and 7). However, EDTs should be interpreted with caution for S7, where the BRIRs show a gradually increasing initial response, caused by the sound propagation from room to room (see [61]).
One potential source of deviations between conditions using measured BRIRs (HATS) and simulations are differences in the dummy head: In both conditions non-individualized HRTFs were used, which might have affected speech intelligibility. The measured BRIRs were measured with the 45BB KEMAR head and torso with anthropometric pinnae (Type KB 5000/1) and ear simulators (RA 0234). For the simulated BRIRs, HRTFs from Braren and Fels [62] were available, which used a 45BB-4 KEMAR head and torso. This KEMAR was equipped with different anthropometric ears (Type KB 0090 and KB 0091) and it has microphones directly placed in the entrances of the ear canals [62]. Although an angle-independent (common) transfer function (see Sect. 2.2) was used to compensate for the deviations between the dummy heads, some direction-dependent differences likely remained.
5.4 Effect of rendering and measurement environment
Comparing loudspeaker and headphone presentations for the simulation, there are no significant differences for a majority of the conditions. However, there are two significant differences for target position STV: one between VR-LS and LB-HP, where the SRT measured with LB-HP is 1.6 dB higher and for LB-LS and LRL-HP, where the SRT measured with LRL-HP is 1.7 dB lower. Nevertheless, these two significant differences are within the intra-individual standard deviations.
Comparing the simulated (and identical) headphone-based renderings presented in the three different locations (living room lab, listening booth, and 3-dimensional loudspeaker array), no statistically significant differences were observed. The maximum deviation was 0.6 dB between LB-HP and LRL-HP for target position STV (see Fig. 2). Thus, the visual impression of the measurement environment did not affect the SRT, even including the conditions with the listeners seated in the real room, where they also performed the “ground truth” measurements. Similarly, Fichna et al. [79] showed no effect of the visual scene representation using virtual reality and a head mounted displays (HMD). Also, in line with the current results, Ahrens and Lund [80] reported no effect of incongruent visual information provided with an HMD on the ability to analyze an auditory scene in terms of accuracy and response time. Regarding perception of room acoustic properties, Schutte et al. [81] reported that the virtual visual information does not influence the perceived reverberation. Ibrahim et al. [82] showed that visual information containing speech cues presented on a screen did not affect speech intelligibility scores for a living room, but for a dinner party and a food court.
Taken together, the visual impression of the room had no effect on speech intelligibility in the current study, even if the listener was seated in the living room where the visual environment is congruent to the presented audio. In the current study, however, no talker and thus no lip movements were visible.
5.5 Relation to other audiological measurements
In “classical” speech audiometry, both co-located and spatially separated target and interferers are typically employed under quasi-anechoic conditions, a setup also evaluated in Experiment 2 (see Fig. 3). Here headphone-based speech intelligibility was measured using anechoic HRTFs for S0N0 and S0N90 and compared to measured BRIRs (HATS) from the living room. SRTs were generally lower for the audiometric conditions than in the reproduced living room lab, particularly for the non-stationary ISTSmale interferer.
As expected, SRTs measured with the (stationary) SSN as an interfering sound source were considerably higher than with the fluctuating ISTSmale, where ‘listening in the dips’ is possible [14, 19] and parts of the target sentence can be perceived in quiet. Also, in line with the literature, the intra-individual differences are larger for the fluctuating interferer [65, 83, 84], reflecting differences in dip-listening performance. A comparison of diotic presentations with SSN and ISTS was performed in Holube et al. [85] with younger and elderly listeners. They reported SRTs of about −22 dB for listeners younger than 30 years using the same matrix sentence test procedure as here with the original ISTS, while with SSN the SRTs were about −8.5 dB, resulting in a difference of 13.5 dB. A much larger standard deviation for the ISTS interferer compared to the SSN interferer was reported. Their measured SRTs show a larger difference between ISTS and SSN compared to our results of −18.8 dB for the ISTSmale and −9.6 dB for SSN, resulting in a difference of 9.2 dB. One difference between the original ISTS and the here used ISTSmale are the gap durations: The ISTS in Holube et al. [85] has longer pause durations with a mean of 188 ms (up to 650 ms) than the modified ISTSmale from Schubotz et al. [65]. ISTSmale has temporal gaps between and within sentence pauses with a mean duration of 136 ms (maximum of up to 571 ms). Wagener et al. [84] investigated the role of silent intervals for sentence intelligibility in fluctuating noise in hearing-impaired listeners. Here, three different speech-simulating fluctuating interferer versions based on “icra” noise [86] were used, with pauses of up to 2 s (original), 250 ms and 62.5 ms. They reported, that with declining pauses the SRT scores increase.
For the data using measured BRIRs (HATS) from the living room in the fluctuating interferer, results can be compared to Experiment 1. In contrast to Experiment 1, STV has a slightly higher (0.32 dB) SRT for ISTSmale in the Experiment 2 (−16.5 dB and −16.28 dB for S5, respectively). In both experiments the difference in mean SRTs between STV and S5 was relatively small: 0.43 dB in Experiment 1, and 0.22 dB in Experiment 2. Comparing the results of Experiment 1 and Experiment 2 regarding SRTs for STV and S5 reveals minor deviations of 0.32 and 0.11 dB, respectively (SRTs in Experiment 1: STV −16.0 dB and −16.4 dB for S5). There is a slightly larger deviation for S7 with the SRT from Experiment 1 being 1.0 dB higher compared to Experiment 2.
Not only the spatial separation of sources in comparison to co-located sources can lead to considerable reductions in speech reception thresholds, but also the absence and presence of reverberation may have an effect. It has been reported, that the “binaural release from masking” is more prevalent in anechoic environments compared to highly reverberant environments [14, 87, 88]. This is caused by two effects: the acoustic head-shadow is reduced due to the reflections and the interferers are more de-correlated between ears, which makes the use of the ITD/IPD difference more difficult [89].
Concerning our results, with SSN, the two target positions in the same room as the receiver (STV: −9.5 dB, and S5: −11.6 dB) showed SRTs in the same range as observed for the standard audiological spatial configuration S0N0 (−9.6 dB), although the interferer is placed at an angle of 45° and the target has an angle of either 0° (STV) and 45° (S5) in the living room. With the ISTSmale interferer lower SRTs were observed (STV: −16.3 dB, and S5: −16.5 dB), while for S0N0 (−18.8 dB) and S0N90 (−24.1 dB) far lower SRTs were reached. For both interferers, the anechoic S0N90 condition lead to the lowest SRTs, which is due to binaural unmasking. In the presence of reverberation, dip listening is expected to be hampered and differences between fluctuating and stationary interferers are expected to decrease. Although the reverberation time is relatively short in the current living room, such a reduction can be observed for S5 and STV (with target and interferer in the same room).
It is often criticized that speech intelligibility tests make use of low SNRs, which are not representative of real-world situations that listeners encounter in their everyday life. Typically, environmental SNRs are above 0 dB [90, 91]. However, Weisser and Buchholz [92] state, that the available surveys do not provide distances to the recorded interlocutor. In their study with a fixed distance of 1 m, negative SNRs could be observed. They report, when distance adaptation by the listener is allowed, the SNRs improve significantly. With the here presented occluded target position S7 in the connected room it is possible to create more ecologically relevant, complex acoustic conditions, which are more challenging compared to the standard S0N0 condition. If the distance is not accounted for, the SRT would be at −3.26 dB for LRL-LS for the normal-hearing participants.
5.6 Limitations and perspectives
The simulated BRIRs utilized in the current study, based on simplified shoebox models for the two interconnected rooms and sound propagation between them, exhibited an average SRT difference of approximately 1.77 dB compared to the ground truth (1.06 dB for STV, 2.97 dB for S5 and 1.3 dB for S7). In comparison, the standard deviation of the SRT measurements was overall 2.39 dB, and thus of similar size. Potential improvements in future simulations might be achieved by simulating wall-specific absorption coefficients. This might particularly improve simulations for S5 which is close to one of the walls. Moreover, additional geometric details could be implemented, such as reflections from the TV screen and glass table top, with solutions for computationally efficient modeling of edge diffraction [61, 93, 94].
In the current study, normal-hearing listeners participated with the focus on comparing real room acoustics with various reproductions and validating the reproduction methods and rendering across different loudspeaker arrays. Future research should involve hearing-impaired listeners to assess the impact of hearing loss and the benefits of hearing aids in both the real-world environment and its virtual reproductions.
6 Conclusions
A laboratory environment resembling a typical living room with a connected (kitchen) room has been established for hearing research. This living room lab enables reproducible measurements in realistic everyday conditions to be compared to virtual reproductions with various methods.
The current study compared “ground truth” measurements in the real room and different acoustic reproductions using a small-scale four-channel loudspeaker array, a three-dimensional large-scale 86-channel loudspeaker array, and headphones. Speech intelligibility was measured for three different target positions and a fixed interferer position in the living room lab, and additionally in two anechoic, standard audiological test conditions.
In the common home environment of a living room, with a relatively small volume and short reverberation time, a challenging, realistic communication situation occurs for a target talker located in an adjoining room without line of sight. With effects of long-term SNR compensated, the detrimental effect of room acoustics for such a condition in the current study (S7) was 3.5 dB in comparison to targets in the same room for the fluctuating interferer.
For the stationary noise interferer in Experiment 2, effects of room acoustics were largely negligible. Particularly for the frontal target located in the same room as interferer and receiver (STV), SRTs in the living room can be well represented by the classical audiological S0N0 condition.
For the comparison of the real room to the different reproduction methods in Experiment 1, a majority of the observed SRT differences were statistically significant, however, relatively small. The SRTs in the living room lab were matched closest by headphone reproduction using measured BRIRs, followed by the room simulations in the three-dimensional large-scale loudspeaker array. Here differences were in the range of 0.15 dB (for the frontal target STV) to 3.64 dB (for the lateral target S5). For the target in the coupled room (S7) and STV (except for the small-scale loudspeaker array), SRTs in simulated headphone and loudspeaker-based reproductions matched within the observed standard deviation to those obtained in the living room. SRT differences between conditions using measured and simulated BRIRs can at least partly be explained by differences observed in the room acoustical parameters. Considering the relatively small deviations from the real room, the simulated room appears generally suited for speech-in-noise testing.
For the lateral target (S5) and the target in the coupled room (S7), no significant difference between the small-scale loudspeaker array with four loudspeakers and the large-scale loudspeaker array was found. This indicates the potential for the applicability of a virtual living room in a clinical context using a small-scale loudspeaker array. For a frontal target, an additional frontal loudspeaker is recommended.
No significant difference in speech intelligibility was observed for simulation-based headphone presentations between the real room and other listening rooms.
Acknowledgments
The authors would like to thank Stefan Fichna for support with generating the room acoustics simulations.
Funding
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Projektnummer 352015383 – SFB 1330 C4 and C5.
Conflicts of interest
The authors declare no conflict of interest.
Data availability statement
An acoustic model of the Living room environment and measured binaural room impulse responses of the Living Room Lab are available in Zenodo: https://doi.org/10.5281/zenodo.5747753. Other research data associated with this article are available on request.
References
- G. Keidser, G. Naylor, D.S. Brungart, A. Caduff, J. Campos, S. Carlile, M.G. Carpenter, G. Grimm, V. Hohmann, I. Holube, S. Launer, T. Lunner, R. Mehra, F. Rapport, M. Slaney, K. Smeds: The quest for ecological validity in hearing science: what it is, why it matters, and how to advance it, Ear and Hearing 41, Supplement 1 (2020) 5S–19S. https://doi.org/10.1097/AUD.0000000000000944. [CrossRef] [PubMed] [Google Scholar]
- H. Fletcher: Auditory patterns, Reviews of Modern Physics 12, 1 (1940) 47–65. https://doi.org/10.1103/RevModPhys.12.47. [CrossRef] [Google Scholar]
- N.F. Viemeister: Temporal modulation transfer functions based upon modulation thresholds, Journal of the Acoustical Society of America 66, 5 (1979) 1364–1380. https://doi.org/10.1121/1.383531. [CrossRef] [PubMed] [Google Scholar]
- A.W. Mills: On the minimum audible angle, Journal of the Acoustical Society of America 30, 4 (1958) 237–246. https://doi.org/10.1121/1.1909553. [CrossRef] [Google Scholar]
- S.D. Ewert: Defining the proper stimulus and its ecology – mammals, in: B. Fritzsch (ed.), The senses: a comprehensive reference, 2nd edn, Elsevier, Oxford, 2020, pp. 187–206. https://doi.org/10.1016/B978-0-12-809324-5.24238-7. [CrossRef] [Google Scholar]
- M. Vorländer: Virtual acoustics: opportunities and limits of spatial sound reproduction, Archives of Acoustics 33, 4 (2008) 413–422. [Google Scholar]
- F. Pausch, J. Fels: Localization performance in a binaural real-time auralization system extended to research hearing aids, Trends in Hearing 24 (2020) 233121652090870. https://doi.org/10.1177/2331216520908704. [CrossRef] [Google Scholar]
- H. Steffens, M. Schutte, S.D. Ewert: Acoustically driven orientation and navigation in enclosed spaces, Journal of the Acoustical Society of America 152, 3 (2022) 1767–1782. https://doi.org/10.1121/10.0013702. [CrossRef] [PubMed] [Google Scholar]
- S. Serafin, A. Adjorlu, L.M. Percy-Smith: A review of virtual reality for individuals with hearing impairments, Multimodal Technologies and Interaction 7, 4 (2023) 36. https://doi.org/10.3390/mti7040036. [CrossRef] [Google Scholar]
- A. Neidhardt, C. Schneiderwind, F. Klein: Perceptual matching of room acoustics for auditory augmented reality in small rooms – literature review and theoretical framework, Trends in Hearing 26 (2022) 233121652210929. https://doi.org/10.1177/23312165221092919. [CrossRef] [Google Scholar]
- A. Weisser, J.M. Buchholz, C. Oreinos, J. Badajoz-Davila, J. Galloway, T. Beechey, G. Keidser: The ambisonic recordings of typical environments (ARTE) database, Acta Acustica united with Acustica 105, 4 (2019) 695–713. https://doi.org/10.3813/AAA.919349. [CrossRef] [Google Scholar]
- J. Peissig, B. Kollmeier: Directivity of binaural noise reduction in spatial multiple noise-source arrangements for normal and impaired listeners, Journal of the Acoustical Society of America 101, 3 (1997) 1660–1670. https://doi.org/10.1121/1.418150. [CrossRef] [PubMed] [Google Scholar]
- D.S. Brungart, N. Iyer: Better-ear glimpsing efficiency with symmetrically-placed interfering talkers, Journal of the Acoustical Society of America 132, 4 (2012) 2545–2556. https://doi.org/10.1121/1.4747005. [CrossRef] [PubMed] [Google Scholar]
- A.W. Bronkhorst: The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions, Acta Acustica united with Acustica 86, 1 (2000) 117–128. [Google Scholar]
- T. Biberger, S.D. Ewert: The effect of room acoustical parameters on speech reception thresholds and spatial release from masking, Journal of the Acoustical Society of America 146, 4 (2019) 2188–2200. https://doi.org/10.1121/1.5126694. [CrossRef] [PubMed] [Google Scholar]
- J.F. Culling, M. Lavandier: Binaural unmasking and spatial release from masking, in: R.Y. Litovsky, M.J. Goupell, R.R. Fay, A.N. Popper (eds), Binaural hearing: with 93 illustrations, Springer International Publishing, Cham, 2021, pp. 209–241. https://doi.org/10.1007/978-3-030-57100-9_8. [Google Scholar]
- J. Kießling, B. Kollmeier, U. Baumann: Versorgung mit Hörgeräten und Hörimplantaten, 3., vollständig überarbeitete und erweiterte Auflage, Thieme Verlag, 2018. https://doi.org/10.1055/b-005-143661. [Google Scholar]
- E. Jorgensen, Y.-H. Wu: Effects of entropy in real-world noise on speech perception in listeners with normal hearing and hearing loss, Journal of the Acoustical Society of America 154, 6 (2023) 3627–3643. https://doi.org/10.1121/10.0022577. [CrossRef] [PubMed] [Google Scholar]
- J.M. Festen, R. Plomp: Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing, Journal of the Acoustical Society of America 88, 4 (1990) 1725–1736. https://doi.org/10.1121/1.400247. [CrossRef] [PubMed] [Google Scholar]
- J. Rennies, H. Schepker, I. Holube, B. Kollmeier: Listening effort and speech intelligibility in listening situations affected by noise and reverberation, Journal of the Acoustical Society of America 136, 5 (2014) 2642–2653. https://doi.org/10.1121/1.4897398. [CrossRef] [PubMed] [Google Scholar]
- I. Arweiler, J.M. Buchholz, T. Dau: Speech intelligibility enhancement by early reflections, Proceedings of the International Symposium on Auditory and Audiological Research 2 (2009) 289–298. [Google Scholar]
- T. Houtgast, H.J.M. Steeneken, R. Plomp: Predicting speech intelligibility in rooms from the modulation transfer function. I. General room acoustics, Acta Acustica united with Acustica 46, 1 (1980) 60–72. [Google Scholar]
- H.J.M. Steeneken, T. Houtgast: A physical method for measuring speech-transmission quality, Journal of the Acoustical Society of America 67, 1 (1980) 318–326. https://doi.org/10.1121/1.384464. [CrossRef] [PubMed] [Google Scholar]
- E. Brandewie, P. Zahorik: Prior listening in rooms improves speech intelligibility, Journal of the Acoustical Society of America 128, 1 (2010) 291–299. https://doi.org/10.1121/1.3436565. [CrossRef] [PubMed] [Google Scholar]
- M. Karjalainen, T. Peltonen: Estimation of modal decay parameters from noisy response measurements, Journal of the Audio Engineering Society 50 (2002) 11. [Google Scholar]
- W. Reichardt, O.A. Alim, W. Schmidt: Abhängigkeit der grenzen zwischen brauchbarer und unbrauchbarer durchsichtigkeit von der art des musikmotives, der nachhallzeit und der nachhalleinsatzzeit, Applied Acoustica 7, 4 (1974) 243–264. https://doi.org/10.1016/0003-682X(74)90033-4. [CrossRef] [Google Scholar]
- B.U. Seeber: Die SOFE-Hörumgebung für die audiologische Forschung–Aufbau und Ergebnisse aus der Anwendung, in: 17. Jahrestagung der Deutschen Gesellschaft für Audiologie, 2014. https://mediatum.ub.tum.de/doc/1222836/document.pdf. [Google Scholar]
- G. Grimm, S. Ewert, V. Hohmann: Evaluation of spatial audio reproduction schemes for application in hearing aid research, Acta Acustica united with Acustica 101, 4 (2015) 842–854. https://doi.org/10.3813/AAA.918878. [CrossRef] [Google Scholar]
- V. Best, G. Keidser, J.M. Buchholz, K. Freeston: An examination of speech reception thresholds measured in a simulated reverberant cafeteria environment, International Journal of Audiology 54, 10 (2015) 682–690. https://doi.org/10.3109/14992027.2015.1028656. [CrossRef] [PubMed] [Google Scholar]
- J. Cubick, T. Dau: Validation of a virtual sound environment system for testing hearing aids, Acta Acustica united with Acustica 102, 3 (2016) 547–557. https://doi.org/10.3813/AAA.918972. [CrossRef] [Google Scholar]
- J.F. Culling: Speech intelligibility in virtual restaurants, Journal of the Acoustical Society of America 140, 4 (2016) 2418–2426. https://doi.org/10.1121/1.4964401. [CrossRef] [PubMed] [Google Scholar]
- T. Weller, V. Best, J.M. Buchholz, T. Young: A method for assessing auditory spatial analysis in reverberant multitalker environments, Journal of the American Academy of Audiology 27, 7 (2016) 601–611. https://doi.org/10.3766/jaaa.15109. [CrossRef] [PubMed] [Google Scholar]
- A. Ahrens, M. Marschall, T. Dau: Measuring speech intelligibility with speech and noise interferers in a loudspeaker-based virtual sound environment, Journal of the Acoustical Society of America 141, 5 (2017) 3510–3510. https://doi.org/10.1121/1.4987360. [CrossRef] [Google Scholar]
- A. Westermann, J.M. Buchholz: The effect of nearby maskers on speech intelligibility in reverberant, multi-talker environments, Journal of the Acoustical Society of America 141, 3 (2017) 2214–2223. https://doi.org/10.1121/1.4979000. [CrossRef] [PubMed] [Google Scholar]
- V. Hohmann, R. Paluch, M. Krueger, M. Meis, G. Grimm: The virtual reality lab: realization and application of virtual sound environments, Ear and Hearing 41, Supplement 1 (2020) 31S–38S. https://doi.org/10.1097/AUD.0000000000000945. [CrossRef] [PubMed] [Google Scholar]
- M.J. Evans, A.I. Tew, J.A.S. Angus: Relative spatialization of ambisonic and transaural speech, in: Audio Engineering Society Convention 104, Audio Engineering Society, 1998. Available at https://www.aes.org/e-lib/browse.cfm?elib=8512 (accessed May 03, 2024). [Google Scholar]
- N. Mansour, M. Marschall, T. May, A. Westermann, T. Dau: Speech intelligibility in a realistic virtual sound environment, Journal of the Acoustical Society of America 149, 4 (2021) 2791–2801. https://doi.org/10.1121/10.0004779. [CrossRef] [PubMed] [Google Scholar]
- J.M. Buchholz, V. Best: Speech detection and localization in a reverberant multitalker environment by normal-hearing and hearing-impaired listeners, Journal of the Acoustical Society of America 147, 3 (2020) 1469–1477. https://doi.org/10.1121/10.0000844. [CrossRef] [PubMed] [Google Scholar]
- K. Kondo, T. Chiba, Y. Kitashima, N. Yano: Intelligibility comparison of Japanese speech with competing noise spatialized in real and virtual acoustic environments, Acoustical Science and Technology 31, 3 (2010) 231–238. https://doi.org/10.1250/ast.31.231. [CrossRef] [Google Scholar]
- A. Ahrens, M. Marschall, T. Dau: Measuring and modeling speech intelligibility in real and loudspeaker-based virtual sound environments, Hearing Research 377 (2019) 307–317. https://doi.org/10.1016/j.heares.2019.02.003. [CrossRef] [PubMed] [Google Scholar]
- M. Rychtáriková, T.V.D. Bogaert, G. Vermeir, J. Wouters: Perceptual validation of virtual room acoustics: Sound localisation and speech understanding, Applied Acoustica 72, 4 (2011) 196–204. https://doi.org/10.1016/j.apacoust.2010.11.012. [CrossRef] [Google Scholar]
- Ľ. Hládek, S.D. Ewert, B.U. Seeber: Communication conditions in virtual acoustic scenes in an underground station, in: 2021 Immersive and 3D Audio: from Architecture to Automotive (I3DA), Bologna, Italy, 8–10 September, IEEE, 2021, pp. 1–8. https://doi.org/10.1109/I3DA48870.2021.9610843. [Google Scholar]
- J. Schütze, S.D. Ewert, C. Kirsch, B. Kollmeier: Virtual acoustics and audiology: speech intelligibility in standard spatial configurations and in a living room, in: Proceedings of the 10th Convention of the European Acoustics Association Forum Acusticum 2023, Turin, Italy, 11–15 September, European Acoustics Association, 2024, pp. 3383–3387. https://doi.org/10.61782/fa.2023.1118. [Google Scholar]
- K.S. Helfer, L.A. Wilber: Hearing loss, aging, and speech perception in reverberation and noise, Journal of Speech, Language and Hearing Research 33, 1 (1990) 149–155. https://doi.org/10.1044/jshr.3301.149. [CrossRef] [Google Scholar]
- R. Plomp, A.M. Mimpen: Improving the reliability of testing the speech reception threshold for sentences, International Journal of Audiology 18, 1 (1979) 43–52. https://doi.org/10.3109/00206097909072618. [CrossRef] [Google Scholar]
- B. Hagerman: Sentences for testing speech intelligibility in noise, Scandinavian Audiology 11, 2 (1982) 79–87. https://doi.org/10.3109/01050398209076203. [CrossRef] [PubMed] [Google Scholar]
- A.W. Bronkhorst, R. Plomp: A clinical test for the assessment of binaural speech perception in noise, International Journal of Audiology 29, 5 (1990) 275–285. https://doi.org/10.3109/00206099009072858. [CrossRef] [Google Scholar]
- B. Kollmeier, M. Wesselkamp: Development and evaluation of a German sentence test for objective and subjective speech intelligibility assessment, Journal of the Acoustical Society of America 102, 4 (1997) 2412–2421. https://doi.org/10.1121/1.419624. [CrossRef] [PubMed] [Google Scholar]
- N.J. Versfeld, L. Daalder, J.M. Festen, T. Houtgast: Method for the selection of sentence materials for efficient measurement of the speech reception threshold, Journal of the Acoustical Society of America 107, 3 (2000) 1671–1684. https://doi.org/10.1121/1.428451. [CrossRef] [PubMed] [Google Scholar]
- A. Dietz, M. Buschermöhle, A.A. Aarnisalo, A. Vanhanen, T. Hyyrynen, O. Aaltonen, H. Löppönen, M.A. Zokoll, B. Kollmeier: The development and evaluation of the Finnish Matrix Sentence Test for speech intelligibility assessment, Acta Oto-laryngologica (Stockholm) 134, 7 (2004) 728–737. https://doi.org/10.3109/00016489.2014.898185. [CrossRef] [PubMed] [Google Scholar]
- M.T. Cord, R.K. Surr, B.E. Walden, O. Dyrlund: Relationship between laboratory measures of directional advantage and everyday success with directional microphone hearing aids, Journal of the American Academy of Audiology 15, 05 (2004) 353–364. https://doi.org/10.3766/jaaa.15.5.3. [CrossRef] [PubMed] [Google Scholar]
- K.M. Miles, G. Keidser, K. Freeston, T. Beechey, V. Best, J.M. Buchholz: Development of the everyday conversational sentences in noise test, Journal of the Acoustical Society of America 147, 3 (2020) 1562–1576. https://doi.org/10.1121/10.0000780. [CrossRef] [PubMed] [Google Scholar]
- S. van de Par, S.D. Ewert, L. Hladek, C. Kirsch, J. Schütze, J. Llorca-Bofí, G. Grimm, M.M. Hendrikse, B. Kollmeier, B.U. Seeber: Auditory-visual scenes for hearing research, Acta Acustica 6 (2022) 55. https://doi.org/10.1051/aacus/2022032. [CrossRef] [EDP Sciences] [Google Scholar]
- G. Grimm, M. Hendrikse, V. Hohmann: Pub environment, Zenodo, 2021. https://doi.org/10.5281/zenodo.5886987. [Google Scholar]
- L. Hladek, B.U. Seeber: Underground station environment, Zenodo, 2022. https://doi.org/10.5281/zenodo.6025631. [Google Scholar]
- F. Wolters, K. Smeds, E. Schmidt, E.K. Christensen, C. Norup: Common sound scenarios: a context-driven categorization of everyday sound environments for application in hearing-device research, Journal of the American Academy of Audiology 27, 07 (2016) 527–540. https://doi.org/10.3766/jaaa.15105. [CrossRef] [PubMed] [Google Scholar]
- C. Díaz, A. Pedrero, The reverberation time of furnished rooms in dwellings, Applied Acoustica 66, 8 (2005) 945–956. https://doi.org/10.1016/j.apacoust.2004.12.002. [CrossRef] [Google Scholar]
- M. Schulte, M. Vormann, M. Meis, K. Wagener, B. Kollmeier: Vergleich der Höranstrengung im Alltag und im Labor, in: 16. Jahrestagung der Deutschen Gesellschaft für Audiologie, 2013. [Google Scholar]
- J. Schütze, C. Kirsch, K.C. Wagener, B. Kollmeier, S.D. Ewert: Living room environment, Zenodo, 2021. https://doi.org/10.5281/zenodo.5747753. [Google Scholar]
- T. Wendt, S. van de Par, S. Ewert: A computationally-efficient and perceptually-plausible algorithm for binaural room impulse response simulation, Journal of the Audio Engineering Society 62, 11 (2014) 748–766. https://doi.org/10.17743/jaes.2014.0042. [CrossRef] [Google Scholar]
- C. Kirsch, T. Wendt, H. Hu, S.D. Ewert: Computationally-efficient simulation of late reverberation for inhomogeneous boundary conditions and coupled rooms, Journal of the Audio Engineering Society 71 (2023) 4. [Google Scholar]
- H.S. Braren, J. Fels: A high-resolution head-related transfer function data set and 3D-scan of KEMAR, RWTH Aachen University, 2020. https://doi.org/10.18154/RWTH-2020-11307. [Google Scholar]
- S.D. Ewert: AFC – a modular framework for running psychoacoustic experiments and computational perception models, in: Proceedings of the International Conference on Acoustics AIA-DAGA, Merano, 18–21 March, DEGA, 2013, pp. 1326–1329. [Google Scholar]
- K. Wagener, T. Brand, B. Kollmeier: Development and evaluation of a German sentence test part III: Evaluation of the Oldenburg sentence test, Zeitschrift Fur Audiologie 38 (1999) 86–95. [Google Scholar]
- W. Schubotz, T. Brand, B. Kollmeier, S.D. Ewert: Monaural speech intelligibility and detection in maskers with varying amounts of spectro-temporal speech features, Journal of the Acoustical Society of America 140, 1 (2016) 524. https://doi.org/10.1121/1.4955079. [CrossRef] [PubMed] [Google Scholar]
- I. Holube, S. Fredelake, M. Vlaming, B. Kollmeier: Development and analysis of an International Speech Test Signal (ISTS), International Journal of Audiology 49, 12 (2010) 891–903. https://doi.org/10.3109/14992027.2010.506889. [CrossRef] [PubMed] [Google Scholar]
- C. Kirsch, J. Poppitz, T. Wendt, S. van de Par, S.D. Ewert: Spatial resolution of late reverberation in virtual acoustic environments, Trends in Hearing 25 (2021) 233121652110549. https://doi.org/10.1177/23312165211054924. [CrossRef] [Google Scholar]
- F. Brinkmann, A. Lindau, S. Weinzierl, M. Müller-Trapet, R. Opdam, M. Vorländer: A high resolution and full-spherical head-related transfer function database for different head-above-torso orientations, Journal of the Audio Engineering Society 65, 10 (2017) 841–848. https://doi.org/10.17743/jaes.2017.0033. [CrossRef] [Google Scholar]
- J.S. Bradley, H. Sato, M. Picard: On the importance of early reflections for speech in rooms, Journal of the Acoustical Society of America 11, 6 (2003) 3233. https://doi.org/10.1121/1.1570439. [CrossRef] [PubMed] [Google Scholar]
- ISO 3382-1:2009 : Acoustics – measurement of room acoustic parameters. Part 1: Performance spaces, 2022. Available at https://www.iso.org/standard/40979.html (accessed December. 30, 2022). [Google Scholar]
- M. Vorländer, H. Bietz, P.-T. Bundesanstalt: Comparison of methods for measuring reverberation time, Acustica 80 (1994) 205–215. [Google Scholar]
- B. Rakerd, E.J. Hunter, M. Berardi, P. Bottalico: Assessing the acoustic characteristics of rooms: a tutorial with examples, Perspectives of the ASHA Special Interest Group 3, 19 (2018) 8–24. https://doi.org/10.1044/persp3.SIG19.8. [CrossRef] [PubMed] [Google Scholar]
- J.S. Bradley, R. Reich, S.G. Norcross: A just noticeable difference in C50 for speech, Applied Acoustica 58 (1999) 99–108. [CrossRef] [Google Scholar]
- A.J. Kolarik, B.C.J. Moore, P. Zahorik, S. Cirstea, S. Pardhan: Auditory distance perception in humans: a review of cues, development, neuronal bases, and effects of sensory loss, Attention, Perception, & Psychophysics 78, 2 (2016) 373–395. https://doi.org/10.3758/s13414-015-1015-1. [CrossRef] [PubMed] [Google Scholar]
- E. Larsen, N. Iyer, C.R. Lansing, A.S. Feng: On the minimum audible difference in direct-to-reverberant energy ratioa), Journal of the Acoustical Society of America 124 (2008) 1. [Google Scholar]
- J. Donley: jdonley/SoundZone_Tools. MATLAB, 2024. Available: https://github.com/jdonley/SoundZone_Tools [Google Scholar]
- A. Haeussler, S. van de Par: Crispness, speech intelligibility, and coloration of reverberant recordings played back in another reverberant room (room-in-room), Journal of the Acoustical Society of America 145, 2 (2019) 931–944. https://doi.org/10.1121/1.5090103. [CrossRef] [PubMed] [Google Scholar]
- K. Hiyama, S. Komiyama, K. Hamasaki: The minimum number of loudspeakers and its arrangement for reproducing the spatial impression of diffuse sound field, in: Audio Engineering Society Convention 113, Audio Engineering Society, 2002. Available at https://www.aes.org/e-lib/online/browse.cfm?elib=11272 (accessed May 11, 2024). [Google Scholar]
- S. Fichna, T. Biberger, B.U. Seeber, S.D. Ewert: Effect of acoustic scene complexity and visual scene representation on auditory perception in virtual audio-visual environments, in: 2021 Immersive and 3D Audio: from Architecture to Automotive (I3DA), Bologna, Italy, 8–10 September, IEEE, 2021, pp. 1–9. https://doi.org/10.1109/I3DA48870.2021.9610916. [Google Scholar]
- A. Ahrens, K.D. Lund: Auditory spatial analysis in reverberant multi-talker environments with congruent and incongruent audio-visual room information, Journal of the Acoustical Society of America 152, 3 (2022) 1586–1594. https://doi.org/10.1121/10.0013991. [CrossRef] [PubMed] [Google Scholar]
- M. Schutte, S.D. Ewert, L. Wiegrebe: The percept of reverberation is not affected by visual room impression in virtual environments, Journal of the Acoustical Society of America 145, 3 (2019) EL229–EL235. https://doi.org/10.1121/1.5093642. [CrossRef] [PubMed] [Google Scholar]
- R. Ibrahim, K. Miles, R.P. Derleth, J.M. Buchholz: Visual speech benefit provided by realistic sentences in noise, Journal of the Acoustical Society of America 154, 4_supplement (2023) 152. https://doi.org/10.1121/10.0023092. [Google Scholar]
- S.D. Ewert, W. Schubotz, T. Brand, B. Kollmeier: Binaural masking release in symmetric listening conditions with spectro-temporally modulated maskers, Journal of the Acoustical Society of America 142, 1 (2017) 12–28. https://doi.org/10.1121/1.4990019. [CrossRef] [PubMed] [Google Scholar]
- K.C. Wagener, T. Brand, B. Kollmeier: The role of silent intervals for sentence intelligibility in fluctuating noise in hearing-impaired listeners: El papel de los intervalos de silencio para la inteligibilidad de frases en medio de ruido fluctuante en sujetos hipoacùsicos, International Journal of Audiology 45, 1 (2006) 26–33. https://doi.org/10.1080/14992020500243851. [CrossRef] [PubMed] [Google Scholar]
- I. Holube, S. Blab, K. Fürsen, S. Gürtler, K. Meisenbacher, D. Nguyen, S. Taesler: Einfluss des Maskierers und der Testmethode auf die Sprachverständlichkeitsschwelle von jüngeren und älteren Normalhörenden, Zeitschrift für Audiologie 48 (2009) 120–127. [Google Scholar]
- W.A. Dreschler, H. Verschuure, C. Ludvigsen, S. Westermann: ICRA noises: artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment. Ruidos ICRA: Señates de ruido artificial con espectro similar al habla y propiedades temporales para pruebas de instrumentos auditivos, Audiology, 2001. Available at https://www.tandfonline.com/doi/abs/10.3109/00206090109073110 (accessed: May 11, 2024). [Google Scholar]
- R. Beutelmann, T. Brand: Prediction of speech intelligibility in spatial noise and reverberation for normal-hearing and hearing-impaired listeners, Journal of the Acoustical Society of America 120, 1 (2006) 331–342. https://doi.org/10.1121/1.2202888. [CrossRef] [PubMed] [Google Scholar]
- M. Lavandier, J.F. Culling: Speech segregation in rooms: monaural, binaural, and interacting effects of reverberation on target and interferer, Journal of the Acoustical Society of America 123, 4 (2008) 2237–2248. https://doi.org/10.1121/1.2871943. [CrossRef] [PubMed] [Google Scholar]
- J. Rennies, G. Kidd: Benefit of binaural listening as revealed by speech intelligibility and listening effort, Journal of the Acoustical Society of America 144, 4 (2018) 2147–2159. https://doi.org/10.1121/1.5057114. [CrossRef] [PubMed] [Google Scholar]
- K.S. Pearsons, R.L. Bennett, S.A. Fidell: Speech levels in various noise environments, Office of Health and Ecological Effects, Office of Research and Development, U.S. EPA, 1977. [Google Scholar]
- K. Smeds, F. Wolters, M. Rung: Estimation of signal-to-noise ratios in realistic sound scenarios, Journal of the American Academy of Audiology 26, 2 (2015) 183–196. https://doi.org/10.3766/jaaa.26.2.7. [CrossRef] [PubMed] [Google Scholar]
- A. Weisser, J.M. Buchholz: Conversational speech levels and signal-to-noise ratios in realistic acoustic conditions, Journal of the Acoustical Society of America 145, 1 (2019) 349–360. https://doi.org/10.1121/1.5087567. [CrossRef] [PubMed] [Google Scholar]
- C. Kirsch, S.D. Ewert: Filter-based first- and higher-order diffraction modeling for geometrical acoustics, Acta Acustica, 8 (2024) 73. https://doi.org/10.1051/aacus/2024059. [CrossRef] [EDP Sciences] [Google Scholar]
- C. Kirsch, S.D. Ewert: Binaural effects and rendering of edge diffraction in geometrical acoustics, Journal of the Acoustical Society of America 154, 4_supplement (2023) A28. https://doi.org/10.1121/10.0023528. [Google Scholar]
Cite this article as: Schütze J. Kirsch C. Kollmeier B. & Ewert SD. 2025. Comparison of speech intelligibility in a real and virtual living room using loudspeaker and headphone presentations. Acta Acustica, 9, 6. https://doi.org/10.1051/aacus/2024068.
All Tables
Overview of broadband room acoustic parameters, T30, Direct to Reverberant Ratio (DRR), SNR and Speech transmission index (STI). T30 and DRR are shown for the measured and simulated BRIRs for the four different source positions in the living room lab. Speech Transmission Index for the measured and simulated BRIRs for the three different target positions in the living room lab (for both channels). Broadband SNRs for the three different target positions for measured and simulated BRIRs (left and right channels shown individually). SNRs in the bottom two rows (indicated with an asterisk) include the level adaptation.
All Figures
![]() |
Figure 1 Top panel: floor plan of the living room lab at the University of Oldenburg. The coupled (kitchen) room is located on the left, connected to the living room by an open door. The receiver position R is indicated by a head symbol on the couch. The different sound source positions are indicated by loudspeaker symbols. The interferer is color coded red, and the target positions are purple. Dimensions are given in meters. The eye symbol in the bottom right corner indicates the position from which the photo shown in the bottom panel was taken. |
In the text |
![]() |
Figure 2 Average SRTs and inter-individual standard deviations of all presentation methods for the three target positions STV, S5, S7 (top to bottom panel). The horizontal dotted line indicates a SRT of −15 dB as a reference for comparing across panels. The horizontal black bars and asterisks indicate significant differences between conditions. |
In the text |
![]() |
Figure 3 Mean SRTs and inter-individual standard deviation for ISTSmale and SSN (left and right panel, respectively) at the three target positions and in the standard audiological configurations indicated on the x-axis: STV, S5, S7 and S0N0, S0N90 (from left to right). |
In the text |
![]() |
Figure 4 Comparison of measured and simulated BRIRs with normalized amplitudes for the four different positions in the living room lab. Measured BRIRs are shown in the top half of each panel, simulated BRIRs are shown in the bottom half. The blue and red color indicate the left and right ear, respectively. |
In the text |
![]() |
Figure 5 Third-octave smoothed spectra of the measured (solid) and the simulated BRIRs (dashed) for the four different source positions, showing the left (blue) and right ear (red). For better comparability, the simulated BRIRs were scaled to match the level of the measured BRIRs. |
In the text |
![]() |
Figure 6 Reverberation time T30 (bold traces) and early decay time (EDT; faint traces) for both ears of the simulated and measured BRIRs as functions of octave band center frequencies. |
In the text |
![]() |
Figure 7 C50 for measured (solid) and simulated (dotted) BRIRs for the left (blue) and right (red) ear. |
In the text |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.