Open Access
Issue
Acta Acust.
Volume 8, 2024
Article Number 17
Number of page(s) 17
Section Hearing, Audiology and Psychoacoustics
DOI https://doi.org/10.1051/aacus/2024009
Published online 01 April 2024

© The Author(s), Published by EDP Sciences, 2024

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

In the field of hearing aid research, the reproduction of virtual acoustic environments via loudspeakers is becoming increasingly important to enable immersive and reproducible experimental designs [1, 2]. While two-dimensional (2D) horizontal playback systems (e.g., loudspeaker rings) are widely used, three-dimensional (3D) periphonic systems are only available in a few laboratories due to the larger number of speakers required and a much higher complexity of the hardware system. In particular, in the clinical context of hearing research and hearing aid provision, only limited laboratory complexity is usually available due to space limitations. Human spatial sensitivity is best in the horizontal plane. The same is true for hearing devices, which typically have a horizontal microphone array geometry. Furthermore, in typical conversational situations, the primary sound sources are approximately at ear level. Therefore, it is of interest how large the perceptual difference is between 2D horizontal and 3D periphonic rendering methods. If periphonic methods do not provide sufficient perceptual benefit when applied to typical acoustic conditions used in hearing research, it would be sufficient to use horizontal methods that require less hardware setup. In this study we compare 2D and 3D rendering methods in the context of hearing aid research and with loudspeaker systems of different complexity.

The potential advantage of periphonic rendering methods is that virtual sound sources can be positioned at any point within the 3D space, which may increase the level of perceived immersion of the virtual environment, and thus may lead to a perceptually more realistic acoustic scene [3]. However, it cannot be excluded that artifacts such as coloration occur above the aliasing frequency and become more pronounced with self-motion [4]. Recent studies on the physical and perceptual limitations of 2D multichannel rendering methods have found that localization accuracy improves with increasing Ambisonics order, and that low-order Ambisonics can lead to splitting of perceived localization [5, 6]. Huisman et al. [7] found that the differences in localization accuracy between higher orders of Ambisonics are small. Diffuse reverberation was found to reduce the perceptual effects of Higher Order Ambisonics (HOA) reproduction errors and increase the usable frequency range [8]. Grimm et al. [9] evaluated the influence of reproduction methods and hearing aid algorithms on hearing aid performance in virtual acoustic environments.

Only a few studies were conducted on the perception of periphonic sound reproduction techniques. Most of these studies investigated 3D microphone techniques [10], mixed-order techniques [11], or performed technical analyses [12]. Data on the influence of 3D rendering methods for multi-loudspeaker setups on the perception of simulated sound sources, however, are sparse.

The purpose of this study is to assess the benefit of periphonic playback methods over horizontal playback in terms of localization accuracy and perceptual ratings, and to compare these results with technical measures. Systems with 16, 29, and 45 loudspeakers, respectively, were compared. Loudspeaker signals were generated using Nearest Speaker Selection (NSP), Vector-Base Amplitude Panning (VBAP) [13], and Higher Order Ambisonics (HOA) [14] with different decoding strategies and parameterizations. Absolute localization performance was assessed with a visual indicator controlled by the listener’s head. Minimum audible angle (MAA) [15] was measured with a 2-alternative forced-choice (2-AFC) experiment. Both absolute localization and MAA experiments can be used to assess the accuracy with which stimuli are localized by listeners. Absolute localization shows the spatial distribution of the accuracy of the playback system, while MAA measures the discrimination threshold and thus resolution. Discrimination thresholds measured by MAA may also be influenced by coloration effects that are not measurable in absolute localization experiments. Perceived quality was evaluated in three different complex acoustic environments representing a concert situation, traffic on a street, and a text lecture.

The hypothesis is that, compared to horizontal rendering methods, periphonic rendering methods result in improved localization of elevated sound sources and improved immersion in complex virtual acoustic scenes, while preserving the spatial resolution that horizontal rendering methods provide for sources on the equatorial plane.

2 Methods

2.1 Rendering methods

This study evaluated three different rendering methods: Nearest Speaker Selection (NSP), Vector-Base Amplitude Panning (VBAP), and Higher Order Ambisonics (HOA), with different decoding strategies. These methods are described in the following sections.

2.1.1 Nearest speaker selection

The Nearest Speaker Selection (NSP) rendering method uses the loudspeaker with the minimum angular distance to a sound source for reproduction of each source, similar as used by Favort and Buchholz [16] and Seeber et al. [17]. This results in minimal spectral artifacts, because each virtual sound source is reproduced using exactly one loudspeaker [18]. The spatial resolution is determined by the spatial density of the loudspeakers.

2.1.2 Vector-base amplitude panning

Vector-Base Amplitude Panning (VBAP) [13] uses the best matching simplex (in the 2D case a pair of loudspeakers, in the 3D case a set of three loudspeakers) for reproduction of a virtual sound source. The normalized loudspeaker positions of the three loudspeakers l1, l2, l3 form a vector base L:

L = (l1xl1yl1zl2xl2yl2zl3xl3yl3z).$$ \mathrm{L}\enspace =\enspace \left(\begin{array}{lll}{l}_{1x}& {l}_{1y}& {l}_{1z}\\ {l}_{2x}& {l}_{2y}& {l}_{2z}\\ {l}_{3x}& {l}_{3y}& {l}_{3z}\end{array}\right). $$(1)

The normalized source vector p in the direction of the virtual sound source can then be expressed in the new vector (non-Cartesian) space as

g = pL-1.$$ \mathbf{g}\enspace =\enspace \mathbf{p}{\mathrm{L}}^{-1}. $$(2)

The normalized source direction vector in the new vector space, ĝ$ \widehat{\mathbf{g}}$, serves as driving weights for the three loudspeakers:

ĝ = g||g||.$$ \widehat{\mathbf{g}}\enspace =\enspace \frac{\mathbf{g}}{\left|\left|\mathbf{g}\right|\right|}. $$(3)

The weights of all other speakers are zero. The set of simplices is usually generated by finding a convex hull around the speaker layout.

2.1.3 Higher order ambisonics

Higher order Ambisonics (HOA) [14] uses cylindrical harmonics (for 2D reproduction) or spherical harmonics (for 3D reproduction) to calculate the driving weights for the loudspeakers. It consists of an encoding stage, which transforms a sound source signal into an Ambisonics representation, and a decoding stage, which converts the Ambisonics representation into a signal to be reproduced via loudspeakers. The decoding stage is a multiplication of the signal vector in Ambisonics representation with a decoding matrix, which depends only on the positions of the loudspeakers. This matrix can be optimized to preserve the direction of the energy vector rE by scaling of signal components representing different Ambisonics orders, which is labeled “maxrE$ \mathrm{ma}{\mathrm{x}}_{{\mathbf{r}}_E}$” decoder, opposed to “basic” decoding. See reference [14] for a definition of the order gains.

For regular loudspeaker layouts the decoding matrix can be found analytically. In case of irregular arrangements of loudspeakers this is not possible. In this study, we compared Ambisonics mode matching method via a Moore-Penrose pseudoinverse and the All-Round Ambisonic decoding (AllRAD) from [19]. Both methods were implemented in C++, and shown to achieve numerically identical results as the MATLAB implementation in the Ambisonics Decoder Toolbox [20, 21].

In this study, 5th order Ambisonics was used for 2D and 3D HOA rendering, which was the highest order for which the periphonic setup decoding matrix was not under-determined. The 2D rendering condition was synthesized in the 2D HOA domain using circular harmonics. The 3D HOA rendering was synthesized separately, using spherical harmonics.

2.1.4 Reference rendering method

As a reference rendering method, a “minimal” condition was used for a subset of experiments. For this method, a single loudspeaker channel at 0° elevation and 11.25° azimuth was used.

2.2 Virtual acoustic environments

Different virtual acoustic environments were used for the perceptual analysis: Localization and MAA experiments were performed with a single virtual sound source whose position was dynamically controlled. Quality rating was performed in a simulated concert hall with early reflections and late reverberation, with 60 sound sources at the positions of a classical orchestra on a stage (labeled “concert”), and with a single sound source in the center of the stage (labeled “speech”). In addition, a virtual street environment with moving background sources and a conversation between four talkers was used (labeled “street”). The audio material in the “concert” environment was an excerpt from Beethoven’s Symphony No. 8 op. 93 [22]. The “speech” environment used a German-language audio book of the fairy tale Snow White [23]. The “street” environment was taken from [24]. Impulse responses and scene recordings of these environments are available at Zenodo [25].

Early reflections were simulated using a first-order image source model. Late reverberation was simulated using a simple feedback delay network in first-order Ambisonics. The feedback matrix was designed as suggested by [26]. The audio is first encoded into first-order Ambisonics. A rotation is applied to each reflection, with a different amount of rotation in each path, resulting in a diffuse sound field. These first-order Ambisonics components were rendered separately from the image source model using a mode-matching first-order Ambisonics decoder to render to all loudspeaker channels. These loudspeaker signals were then decorrelated before being added to the output of the image source model to avoid the effects of spectral coloration on self-motion. The late reverberation rendering was independent of the rendering method (NSP, VBAP, HOA) and compensated only for the number of loudspeakers involved, so that the direct-to-reverberation ratio was equal across rendering methods and loudspeaker layouts.

2.3 Apparatus and simulation software

The experiments were conducted in a laboratory at the University of Oldenburg. It was an acoustically treated room with carpet on the floor. Inside this room, a cylindrical setup with a diameter of 3.5 m was used, which was constructed as a metal frame, the outside of which was covered with heavy dark fabric to reduce high frequency room reflections, noise and diffuse light inside the array. Loudspeakers and other equipment were mounted on this structure, which was covered on the inside by an acoustically transparent projection screen. Some of the loudspeakers at the entrance were directly visible, while others were partially visible through the fabric of the cylindrical projection surface.

The loudspeaker array used in the experiments consisted of 45 Genelec 8020 loudspeakers arranged in five rings of varying elevation: On the first main ring, which was 1.6 m above the floor level (0° elevation), 16 loudspeakers were placed every 22.5° starting at 11.25° azimuth. The second main ring was at approximately −16° elevation and also consisted of 16 loudspeakers starting at 11.25° azimuth and spaced 22.5° apart. The choice of using two main rings is based on constraints of the laboratory, allowing participants to be surrounded by a 16-channel ring at ear level in standing and sitting position. The lower ring at approximately −40° elevation consisted of six loudspeakers placed every 60° starting at 30° azimuth. The upper ring at approximately 30° elevation also consisted of six speakers spaced 60° apart, but starting at 0° azimuth. In addition, there was one loudspeaker at the top of the array (approximately 70° elevation, 25° azimuth). Two different 3D speaker layouts were tested, either with or without the ring at −16° elevation, resulting in a total of 45 and 29 speakers, respectively. The loudspeaker distance was corrected by appropriate delays and gains. The gains of all loudspeakers were matched afterwards using a measurement microphone at the listening position. The exact loudspeaker positions can be found in Appendix, Table A1.

The asymmetry between the upper and lower hemispheres, as well as the fact that no loudspeaker was placed at 0° azimuth, was attributed to limitations in the possible mounting positions of the metal frame. This is not a general limitation, as in experiments where a physical loudspeaker is required at 0°, both the participant and the video projection can be rotated accordingly.

Participants were seated in the center of the setup. A marker on the floor marked the center of the loudspeaker setup, and a chair for the participants was placed on a platform at this position so that ear level was approximately at the height of the first loudspeaker ring. The participants’ head position was on average 12.7 cm behind, 0.7 cm to the right, and 1.1 cm (top of the head) below the center of the setup, pooled across the duration of the localization experiments and across participants. The standard deviation of the head position was 4.4 cm in the front-back direction, 2.5 cm in the left-right direction, and 4.3 cm in the up-down direction. During the experiments, the setup around the participants was darkened.

For visual feedback in localization experiments, a virtual spot light rendered in the Blender Game Engine [27, 28], virtually attached to the participant’s head, was displayed by three video projectors enabling a continuous field of view of 300° in azimuth and 2 m height. The projection screen was acoustically transparent and negligibly reflective (transmission loss −0.1 dB, reflection −32 dB) [29]. The head movements were tracked by a Qualisys infrared marker tracking system, consisting of six fixed cameras, and a crown equipped with infrared markers worn by the participants. A Behringer XTouch One MIDI controller, a hand-held push button, and a touchscreen were used to collect the participant’s responses.

The Toolbox for Acoustic Scene Creation and Rendering (TASCAR, [30, 31]) was used to create the acoustic stimuli, for the interaction between the motion sensor and the virtual spot light, and for data logging. The experimental control and the graphical user interface (GUI) of MAA measurements as well as quality perception experiments were implemented in MATLAB [32].

2.4 Technical performance measure

The technical performance measures are based on the magnitude rv and direction r̂v$ {\widehat{\mathbf{r}}}_v$ of the velocity localization vector rv,

rvr̂v = rv = i = 1NGiûii = 1NGi,$$ {r}_v{\widehat{\mathbf{r}}}_v\enspace =\enspace {\mathbf{r}}_v\enspace =\enspace \frac{\sum_{i\enspace =\enspace 1}^N {G}_i{\widehat{\mathbf{u}}}_i}{\sum_{i\enspace =\enspace 1}^N {G}_i}, $$(4)

as well as the magnitude rE and direction r̂E$ {\widehat{\mathbf{r}}}_E$ of the energy localization vector rE,

rEr̂E = rE = i = 1N|Gi|2ûii = 1N|Gi|2$$ {r}_E{\widehat{\mathbf{r}}}_E\enspace =\enspace {\mathbf{r}}_E\enspace =\enspace \frac{\sum_{i\enspace =\enspace 1}^N |{G}_i{|}^2{\widehat{\mathbf{u}}}_i}{\sum_{i\enspace =\enspace 1}^N |{G}_i{|}^2} $$(5)

as defined by [14, 33, 34], where Gi are the real-valued loudspeaker gains of speaker i for a virtual sound source at a direction vector psrc, ûi$ {\widehat{\mathbf{u}}}_i$ is a unit vector in the direction of loudspeaker i, and N is the number of loudspeakers.

These performance measures were applied to test their reliability in serving as a prediction tool. They were computed for all rendering methods, based on a 2D and a 3D sampling of virtual sound source positions psrc with |psrc| = 1. The 2D sampling consisted of P = 360 points evenly distributed on a horizontal ring. The 3D sampling consisted of P = 2432 points evenly distributed on a sphere. An angular error was calculated as

Erv = 1Pi = 1Parccos(psrc,ir̂v,i),$$ {E}_{{\mathbf{r}}_v}\enspace =\enspace \frac{1}{P}\sum_{i\enspace =\enspace 1}^P \mathrm{arccos}\left({\mathbf{p}}_{\mathrm{src},i}\cdot {\widehat{\mathbf{r}}}_{v,i}\right), $$(6)

and

ErE = 1Pi = 1Parccos(psrc,ir̂E,i),$$ {E}_{{\mathbf{r}}_E}\enspace =\enspace \frac{1}{P}\sum_{i\enspace =\enspace 1}^P \mathrm{arccos}\left({\mathbf{p}}_{\mathrm{src},i}\cdot {\widehat{\mathbf{r}}}_{E,i}\right), $$(7)

respectively, resulting in the four overall performance measures Erv,2D$ {E}_{{\mathbf{r}}_v,2D}$, Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$, ErE,2D$ {E}_{{\mathbf{r}}_E,2D}$ and ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$. For a direct comparison with the localization performance, the median of the azimuth and elevation difference, respectively, between r̂E$ {\widehat{\mathbf{r}}}_E$ and psrc,i across all positions i used in the localization experiment (see Fig. 1) was calculated, labeled Δaz,med and Δel,med.

thumbnail Figure 1

Positions of the virtual sound sources psrc in the localization experiment (black symbols). The white area indicates the projection screen, the loudspeaker positions are represented as gray dots. Black crosses denote source positions with 0° elevation, triangles elevation > 0°, and squares elevation < 0°. Please note that not all rendering methods used all loudspeakers. Only the frontal part of the setup, i.e. the area including the virtual sound source positions, is shown here. The setup also involved loudspeakers at positions that are not shown here.

2.5 Subjective performance measures

2.5.1 Source localization error

Subjective localization of virtual sources was evaluated in a setup with a single virtual sound source. The acoustic stimulus was the International Speech Test Signal (ISTS) [35] rendered at 17 different source positions, as visualized in Figure 1. The source positions were selected to cover combinations of a range of different azimuth and elevation angles. Of these source positions, nine were at the height of the main speaker ring on the equatorial plane (0° elevation), three positions were above 0° elevation, and five positions were below 0° elevation. An optical pointer was projected on the screen at the participants’ current head direction. The participants’ task was to point this pointer at the perceived position of the virtual source and then to confirm their choice by pressing a button, similar to the procedure introduced by [36]. As visible in Figure 1, some of the virtual source positions were above or below the projection screen, so that the visual pointer became invisible. Participants were instructed to point their head towards the sound source even when the visual pointer disappeared at the end of the screen. After confirming the position, the next virtual source was activated and the participants repeated their task. Using this method, the head was always facing the sound source, resulting in the greatest possible spatial sensitivity to changes in the binaural cues. Thus, the measurement tested the performance of the rendering system rather than the human sound localization performance. This procedure was repeated for all virtual sources and a subset of all rendering methods in random order; see Table 1 for an overview.

Table 1

Overview of the rendering methods. The first column specifies the label used in this paper, the second column the rendering method, and in case of HOA also the decoder type. The column “Per.” indicates whether the rendering method was using periphonic or horizontal reproduction. This column is followed by the number of loudspeakers N, and an indication if the method was applied in the localization experiment (“Loc”), in the minimum audible angle experiment (“MAA”), and in the perceptual quality rating (“Rating”). An “X” indicates that the rendering method was used in the respective experiment, “(X)” indicates that data was taken from a previous study [37].

From the difference between the perceived localization and the positions of the virtual sound source, a localization error in azimuth and elevation was calculated, averaged across all presented source position in the groups elevation 0°, elevation > 0° and elevation < 0°. Positive and negative elevations were analyzed separately to account for differences in the resolution of loudspeaker setups. Localization errors were computed in the same way for horizontal and periphonic rendering methods. Virtually elevated sources were also tested with horizontal rendering methods, since there may be listener expectations that lead to a perception of elevation even when only loudspeakers on the horizontal plane are used. In addition, the measurement of vertical errors for horizontal-only reproduction serves as a reference for measurements with periphonic rendering methods.

2.5.2 Spatial resolution

To evaluate the perceptual spatial resolution of the different rendering methods, the MAA [15] was measured in the azimuth plane using a two-AFC experiment. The stimulus was a pink noise burst of 500 ms duration, without ramps at the boundaries. Broadband noises, like pink noise, are found to be well localizable and frequently used in MAA experiments [38, 39]. Horizontal-only rendering methods were assessed with 0° elevation, periphonic rendering methods with 0° and −30° elevation. Each trial consisted of two sounds at different azimuth angles. After the presentation of both sounds, the participants were asked to indicate if the stimulus was moving from right to left or from left to right. The azimuth angles of the reference interval were roving in the range of −22.5 to 11.25 to avoid effects of coloration differences when one of the positions was matched by a loudspeaker.

2.5.3 Perceptual quality rating

The perception of different sound quality aspects was assessed using a collection of questions based on the spatial audio quality inventory (SAQI) [40]. The ratings were retrieved using a MATLAB GUI. A 5-point scale was used for input, with a label at the top and bottom. The participants could switch between eight rendering methods at any time; when a rendering method was selected, the current acoustic scene reproduced with the selected rendering method was faded in with a cross-fade of 500 ms duration. The order of rendering methods was re-randomized for each question, and assigned to letters A–H. For input, participants placed an arrow on the scale. Each rendering method had to be listened to for at least five seconds before a rating was possible. After all rendering methods had been rated, it was possible to move on to the next question.

The perceptual quality rating was assessed in three different virtual acoustic environments (see Sect. 2.2). To familiarize with the method and the items of the questionnaire, a training session in the “concert” environment was presented before the actual rating.

The following set of questions was selected, each item with the German version presented to the participants, and including the labels:

  1. How well can the locations of each sound source be estimated? (Wie gut lassen sich die Orte der einzelnen Schallquellen einschätzen?): very bad/very good (sehr schlecht / sehr gut).

  2. How much do you feel spatially present in the scene? (Wie sehr fühlen Sie sich in der Szene räumlich präsent?): little present/very present (wenig präsent / sehr präsent).

  3. How natural does the acoustic environment sound? (Wie natürlich klingt die akustische Umgebung?): unnatural/natural (unnatürlich / natürlich).

  4. How much does your experience of the virtual environment resemble that of a real environment? (Wie sehr gleicht Ihr Erleben der virtuellen Umgebung dem einer realen Umgebung?): not at all/completely (überhaupt nicht / vollständig).

  5. How extended are the sound sources? (Wie weit ausgedehnt sind die Schallquellen?): very wide/very little (sehr weit / sehr wenig).

2.6 Participants

Eleven self-reported normal hearing listeners at the age of 20–38 years (7 female and 4 male) participated in the experiments.

2.7 Hypotheses and data analysis

To test the hypothesis that periphonic rendering methods result in improved localization of elevated sound sources and improved immersion in complex virtual acoustic scenes, while preserving the spatial resolution that horizontal rendering methods provide for sources on the equatorial plane, a number of more specific hypotheses can be formulated:

  • H1: For elevated sources, the vertical localization error with periphonic rendering methods is smaller than the vertical localization error with horizontal rendering methods.

  • H2: For sources on the equatorial plane, the horizontal localization error with periphonic methods is the same as the horizontal localization error with horizontal methods.

  • H3: The MAA with periphonic rendering methods is the same as the MAA with horizontal rendering methods, for sources on the equatorial plane.

  • H4: The perceptual quality rating of periphonic rendering methods is higher than the rating with horizontal rendering methods, at least for some items.

An overview of the rendering methods assessed to test the hypotheses is shown in Table 1, including information about the horizontal/periphonic characteristics of each method, the number of loudspeaker channels and which methods were investigated in which experiments.

All hypotheses were tested by performing a paired t-test for all pairs of periphonic methods and horizontal methods. The t-test was applied, because separate comparisons between the data obtained for the individual rendering methods were required. Therefore, the hypotheses were each tested with different data, so no simultaneous comparisons were conducted and therefore no correction (e.g. Bonferroni) was necessary.

2.8 Comparison of technical and subjective measures

The relation between technical and subjective measures was evaluated by computing the correlation between technical measures and means of subjective measures, for all rendering methods except for “minimal”. Pearson’s correlation coefficient r was determined between the technical measures Erv,2D$ {E}_{{\mathbf{r}}_v,2D}$, Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$, ErE,2D$ {E}_{{\mathbf{r}}_E,2D}$ and ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$ and the subjective perceptual rating data of each experiment, and for Δaz,med and Δel,med and the localization error in azimuth and elevation, respectively. Furthermore, the p-values testing the significance of the correlation were determined.

3 Results

3.1 Technical measures

The angular errors Erv,2D$ {E}_{{\mathbf{r}}_v,2D}$, Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$, ErE,2D$ {E}_{{\mathbf{r}}_E,2D}$ and ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$ for the tested rendering methods are shown in Table 2. These errors E represent the mean angular distance between the velocity localization vector rv respectively the energy localization vector rE and virtual source positions sampled on a horizontal ring (2D) respectively on a sphere (3D), as introduced in Section 2.4. The “minimal” method with a single loudspeaker resulted in an error of 90°, with all measures. Based on the 2D error measures, this was followed by “nsp45”. The lowest 2D error was achieved with the two “hoa2d” methods. The Erv,2D$ {E}_{{\mathbf{r}}_v,2D}$ and Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$ were both smallest for the “vbap45” rendering method. ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$ was smallest when using the “hoa” rendering methods with AllRAD decoders.

Table 2

Angular errors Erv,2D$ {E}_{{\mathbf{r}}_v,2D}$, Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$, ErE,2D$ {E}_{{\mathbf{r}}_E,2D}$ and ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$ for the tested rendering methods, averaged across 360 points on a ring (2D) or 2432 points on a sphere (3D), respectively. The standard deviations are given in parentheses.

The median azimuth difference Δaz,med between rE and the source positions used in the localization test is shown in Table 3. This difference was generally below 1, with the exception of the “minimal” render method, and the “nsp45” for sources on ear level.

Table 3

Median azimuth difference Δaz,med between the energy localization vector rE and the source positions used in the localization test.

Similarly, the median elevation difference Δel,med between rE and the source positions used in the localization test is shown in Table 4. For sources on the ear level, the difference was 0 or almost 0 for all methods except for “hoa29_allrad” and “hoa45_allrad”, where it was 1.8° and 3.1°, respectively. Sources above ear level showed a median difference of −20° for 2D rendering methods, which is exactly the negative of the median elevation of those sources. Similarly, the median difference of sources below ear level was 30°. The median difference of elevation for all other rendering methods was small compared to human localization performance in elevation. The largest differences were found for “hoa29_allrad”, “hoa45_allrad”, and “nsp45”.

Table 4

Median elevation difference Δel,med between the energy localization vector rE and the source positions used in the localization test.

3.2 Subjective measures

3.2.1 Absolute sound source localization

Absolute sound source localization was assessed to test the hypothesis that periphonic rendering methods result in an improved localization of elevated sound sources compared to horizontal rendering methods (H1), and that periphonic rendering methods are maintaining the localization accuracy provided by horizontal rendering methods for sources on the equatorial plane (H2). Therefore, the results of the subjective absolute localization experiments were analyzed separately for vertical localization (Fig. 2) and horizontal localization (Fig. 3).

thumbnail Figure 2

Median and interquartile range of the vertical localization error, pooled across test participants and source positions. The black lines represent the negative median elevation of tested virtual positions of the respective elevation group; values close to those values mean that the sources were localized on the equatorial plane.

thumbnail Figure 3

Median and interquartile range of the horizontal localization error, pooled across test participants and source positions.

Figure 2 visualizes the vertical localization error for the tested rendering methods in three different categories. The three categories separate the data into target source positions on the equatorial plane (elevation 0°), above (elevation > 0°) and below (elevation < 0°). The vertical localization error denotes the difference between the virtual target elevation angle and the detected elevation angle.

For target sources on the equatorial plane, the vertical localization errors are close to 0° for all rendering methods. Looking at the interquartile range, there is a small bias towards detecting the sources a little too far up, caused by the fact that the average ear level was slightly below the loudspeakers. For sources above and below the equatorial plane, the interquartile ranges are larger than for sources on the equatorial plane. Target sources above the equatorial plane (elevation > 0°) were localized too low by the participants, resulting in negative vertical localization errors. This is true for both periphonic and horizontal rendering methods, but the errors are smaller for periphonic methods. For the horizontal methods (“hoa2d” and “hoa2d_basic”), the median error is approximately equal to the negative median elevation of the tested virtual positions (visualized by the black bar). This means that the sources were detected approximately on the equatorial plane when a horizontal rendering method was used. Target sources below the equatorial plane (elevation < 0°) were localized too high, resulting in positive vertical localization errors. Again, the error magnitudes are larger for horizontal rendering methods, but compared to target sources above the equatorial plane, the error magnitude is larger for target sources below the equatorial plane.

The hypothesis that periphonic rendering methods lead to a smaller vertical localization error than horizontal rendering methods in case of elevated sources (H1) can be confirmed: A one-sided t-test revealed that the vertical localization error obtained with periphonic rendering methods had a smaller magnitude than the vertical localization error obtained with horizontal rendering methods. This was significant for all source positions with an elevation angle unequal to zero (p < 0.05 for all paired comparisons, see Table A2 in Appendix for more details). However, the vertical localization error for elevated sources was not zero for any of the rendering methods, indicating a strong localization bias towards the equatorial plane.

Figure 3 shows the horizontal localization error, again separately for virtual target positions on, above, and below the equatorial plane. The horizontal localization error is the difference between the virtual target azimuth and the detected azimuth.

Regarding the horizontal localization error for sources on the equatorial plane, a two-sided t-test showed significant differences of the means for “hoa2d” and “hoa29_allrad” (t(98) = 3.1, p = 0.0024) and for “hoa2d” and “vbap45” (t(98) = 2.5, p = 0.013), however, the means differ only by up to 1.3°. All other combinations of horizontal and periphonic rendering methods did not differ significantly. More details of the statistical results are provided in Table A3 in Appendix. Therefore, the hypothesis that the horizontal localization error with periphonic methods is the same as for horizontal methods for sources on the equatorial plane (H2) can be confirmed for most combinations of horizontal and periphonic rendering methods. For the remaining combinations, the difference was significant, but small compared to the overall localization performance. A localization bias of approximately 2° could be found in the data for sources on the equatorial plane. Due to the choice of pointing method, the participants were always facing the source when making the localization decision, so that they were coming from 0° azimuth and 0° elevation in the head-related coordinate system, where the resolution is best.

3.2.2 Spatial resolution

The spatial resolution, quantified by the MAA, is shown in Figure 4 for elevation angles of 0° and −30°. The bars represent the interquartile range and the bold lines show the median value over all participants for each condition. For comparison, data for reproduction schemes with less than 45 loudspeakers (without the second main loudspeaker ring) from an earlier study [37] was added, which was measured using the same paradigm and a different group of ten self-reported normal hearing listeners.

thumbnail Figure 4

Median and interquartile range of the MAA in degrees. For comparison, the data of renderings with less than 45 channels are taken from a former study with a different group of ten normal-hearing participants [37].

For the statistical analysis, an unpaired t-test was applied to compare the means of the horizontal rendering methods with the means of the periphonic rendering methods. The resulting statistics are shown in Table A4 in Appendix. The unpaired test was used to account for the different participant groups for different rendering methods. For all combinations of horizontal and periphonic rendering methods, the MAA of sources on the equatorial plane did not differ significantly between methods at a significance level of p = 0.05, thus confirming the hypothesis H3 that the MAA with periphonic rendering methods is the same as the MAA with horizontal rendering methods for sources on the equatorial plane. On this plane, the MAA had median values in the range of 2–3° for all rendering methods. The interquartile range covered a range between 1° and 4°.

For sources at −30° elevation the MAA was measured only with periphonic rendering methods, because a comparison between horizontal and periphonic rendering methods is not meaningful in this case. However, it can be seen that the MAA is slightly larger for sources below the equatorial plane than it is for sources on the equatorial plane. Furthermore, the interindividual spread is larger in this case, leading to larger interquartile ranges than for the measurements on the equatorial plane. The MAA at −30° had values between approximately 3.5° and 4.5° for HOA rendering methods, and a larger MAA of approximately 5.5° for “vbap45”.

3.2.3 Perceptual quality rating

From the perceptual data analysis, one participant who was detected as an outlier for the “minimal” rendering method in more than three combinations of item and environment was removed. For the analysis of the perceptual quality rating, a Welch’s t-test was used, because the variance was unequal between comparisons.

The results of the statistical analysis are shown in Tables A5–A9 in Appendix for the respective questions. For each comparison between horizontal and periphonic rendering methods in each environment, the resulting t- and p-values are provided.

Figures 59 visualize the median and interquartile range of the ratings for each question, respectively. The labels on the y-axis span the range of the labels that were also presented in the experiments. The x-axis divides each figure into the three different environments. For each rendering method tested, the median value of the given responses is shown as a solid line, and the interquartile range is shown as a bar.

thumbnail Figure 5

Median and interquartile range of the rating responses to question “How well can the locations of each sound source be estimated?”.

The rating data of perceived localization is shown in Figure 5. The statistical analysis shows that the perceived localization depends on the environment and the rendering method. In the “concert” environment, “hoa2d_basic” was rated higher than “hoa45_allrad” (t(16.31) = 1.9, p = 0.036), and higher than “vbap45” (t(17.91) = 2, p = 0.033). The anchor condition (“minimal” rendering method) was perceived worse than all other methods. In the “speech” environment, the “minimal” condition performed similarly or even better than the multi-channel methods. In the tested comparisons, no significant differences were found. The reason for this behavior could stem from the fact that this environment has only one main sound source and is less complex than the other environments. In the “street” environment, the “minimal” condition was again perceived worse than the other methods, comparable to the “concert” environment. Here, “hoa2d” was rated higher than “hoa45_allrad” (t(17.94) = 1.8, p = 0.042), and “hoa2d_basic” was rated lower than “hoa29_pinv” (t(15.69) = −1.9, p = 0.038). The remaining comparisons were not significant.

The data show that across the different environments, using periphonic rendering methods does not improve perceived localization compared to horizontal rendering methods.

The spatial presence rating (see Fig. 6) presents a less clear picture. In the “concert” environment, the spatial presence with “hoa2d” was rated higher than with “vbap45” (t(17.85) = 1.9, p = 0.036), but all other combinations did not show significant differences. In the “speech” environment, the spatial presence with “nsp45” was rated higher than the presence with “hoa2d” (t(16.77) = −2, p = 0.031) or “hoa2d_basic” (t(16.89) = −2.1, p = 0.026). On the other hand, with “hoa45_allrad”, the spatial presence was rated lower than with “hoa2d” (t(18.0) = 2.5, p = 0.012) and “hoa2d_basic” (t(18.0) = 2.4, p = 0.015). In the street environment, no significant differences could be found. The highest rating of spatial presence was achieved with the “hoa2d” rendering method in the “concert’ environment.

thumbnail Figure 6

Median and interquartile range of the rating responses to the question “How much do you feel spatially present in the scene?”.

Regarding the naturalness shown in Figure 7, in the “concert” environment “hoa2d” was rated with a significantly higher naturalness than “hoa29_pinv” (t(17.64) = 2.6, p = 0.01) and than “hoa45_allrad” (t(17.55) = 1.8, p = 0.048). All other comparisons of horizontal with periphonic rendering methods did not reveal a significant difference in this environment. In the “speech” environment, “hoa2d” was rated less natural than all periphonic rendering methods (p < 0.05). “hoa2d_basic” was rated less natural than “nsp45” (t(14.63) = −2.8, p = 0.0073). In the “street” environment, “hoa2d_basic” was rated more natural than “hoa45_allrad” (t(16.45) = 2.3, p = 0.019). The remaining comparisons in this environment did not yield significant differences. The most natural rating across all environments and rendering methods was found for “hoa2d” in the “concert” environment.

thumbnail Figure 7

Median and interquartile range of rating responses to the question “How natural does the acoustic environment sound?”.

The rating related to a comparison between virtual and real world (see Fig. 8) shows a similar result as the rating of naturalness. In the “concert” environment the “hoa2d” rendering method was rated higher than “hoa45_allrad” (t(16.20) = 2.2, p = 0.023). In the “speech” environment, “nsp45” was rated higher than “hoa2d” (t(14.32) = −3.1, p = 0.004) and “hoa2d_basic” (t(17.24) = −2.9, p = 0.005), and “vbap45” was rated higher than “hoa2d” (t(15.53) = −2.1, p = 0.026). In the “street” environment, “hoa2d” was rated higher than “hoa45_allrad” (t(17.84) = 1.9, p = 0.04). Other comparisons in these environments did not lead to significant differences. Across all environments and rendering methods, “hoa2d” was rated highest in the “concert” and “street” environment, indicating that the experience of the virtual world almost completely resembled that of the corresponding real environment.

thumbnail Figure 8

Median and interquartile range of the rating responses to the question “How much does your experience of the virtual environment resemble that of a real environment?”.

The source width rating is shown in Figure 9. In all three environments, the “minimal” rendering method led to the smallest source width, while the other rendering methods produced results with a wide spread and mostly in a similar range. In the “speech” environment, “hoa2d” was rated with a significantly wider source width than “hoa45_allrad” (t(17.68) = −2.4, p = 0.0.13) and “vbap45” (t(16.02) = −2.6, p = 0.01). The remaining comparisons did not show significant results.

thumbnail Figure 9

Median and interquartile range of the rating responses to the question “How extended are the sound sources?”.

Taking into account all environments and all perceptual questions, the “nsp45” rendering method seems to perform well on average. It provides small variations between the different environments and leads to high overall ratings.

In summary, the perceptual rating results confirm the hypothesis that the perceptual rating of periphonic rendering methods is higher than for horizontal rendering methods (H4) only in the scene with lowest complexity. In all other conditions, the hypothesis must be rejected.

3.3 Comparison of technical and subjective measures

To evaluate the relation between technical and subjective measures, correlations between these measures were calculated. In the following tables, non-significant correlation values, i.e. with p ≥ 0.05, are written in parentheses.

3.3.1 Absolute sound source localization

The correlation analysis results for the absolute sound source localization experiment are shown in Table 5. For the elevation direction, the technical measure Δel,med, referring to the median elevation difference between rE and the source positions in the localization test, significantly predicted the listeners’ performance for sources above and below the equatorial plane, with correlation coefficients of 0.95 and 0.84, respectively. However, the performance in azimuth direction was not predicted by the technical measures.

Table 5

Pearson’s correlation coefficients r between Δel,med and subjective localization error in elevation (first row), and Δaz,med and subjective localization error in azimuth (second row). Correlation values in parentheses are not significant (p ≥ 0.05), significant correlation values are printed in bold.

3.3.2 Perceptual quality rating

To evaluate whether subjective quality perception can be predicted by technical measures, the correlation analysis was conducted for all perceptual quality tasks. Tables 69 show the correlation analysis results for the perceptual quality rating experiment with respect to the measures Erv,2D$ {E}_{{\mathbf{r}}_v,2D}$, Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$, ErE,2D$ {E}_{{\mathbf{r}}_E,2D}$ and ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$, respectively. It can be seen that only a few comparisons lead to significant correlations. This indicates that the technical measures are not particularly reliable predictors of perceptual quality ratings, with a few exceptions: Erv,2D$ {E}_{{\mathbf{r}}_v,2D}$ correlates significantly with the perception of localization (r = 0.81, p = 0.0276) and naturalness (r = 0.83, p = 0.0209) in the “speech” environment. Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$ and ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$ both correlate significantly with localization in the “concert” environment (Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$: r = 0.87, p = 0.0113; ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$: r = 0.84, p = 0.0182), with presence in the “street” environment (Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$: r = 0.79, p = 0.0345; ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$: r = 0.80, p = 0.0308), and with source width in the “speech” environment (Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$: r = −0.80, p = 0.0323; ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$: r = −0.80, p = 0.0296). ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$ shows a significant correlation with presence (r = 0.86, p = 0.0135) and immersion/realism (r = 0.83, p = 0.0216) in the “speech” environment.

Table 6

Pearson’s correlation coefficients r between technical error measure Erv,2D$ {E}_{{\mathbf{r}}_v,2D}$ and subjective data from perceptual quality rating experiment. Correlation values in parentheses are not significant (p ≥ 0.05), significant correlation values are printed in bold. The first column titled “question” refers to the list of perceptual quality ratings introduced in Section 2.5.3.

Table 7

Pearson’s correlation coefficients r between technical error measure Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$ and subjective data from perceptual quality rating experiment. Correlation values in parentheses are not significant (p ≥ 0.05), significant correlation values are printed in bold. The first column titled “question” refers to the list of perceptual quality ratings introduced in Section 2.5.3.

Table 8

Pearson’s correlation coefficients r between technical error measure ErE,2D$ {E}_{{\mathbf{r}}_E,2D}$ and subjective data from perceptual quality rating experiment. Correlation values in parentheses are not significant (p ≥ 0.05), significant correlation values are printed in bold. The first column titled “question” refers to the list of perceptual quality ratings introduced in Section 2.5.3.

Table 9

Pearson’s correlation coefficients r between technical error measure ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$ and subjective data from perceptual quality rating experiment. Correlation values in parentheses are not significant (p ≥ 0.05), significant correlation values are printed in bold. The first column titled “question” refers to the list of perceptual quality ratings introduced in Section 2.5.3.

4 Discussion

On the basis of previous studies, e.g. [41], an improvement in vertical localization was expected when using periphonic rather than horizontal playback methods, especially for broadband stimuli such as the speech stimulus used in this experiment. This raises the question why only a small advantage for vertical localization performance was found in this study. One explanation could be the participants’ expectation that the stimuli were in the equatorial plane, e.g., due to the residual visual cues from the light shining through the projection screen or because the speakers were clearly visible at eye level when entering the laboratory. The partial visibility of the loudspeakers may have biased the participants’ responses. Since this concerns both the 2D and 3D rendering conditions, it is assumed that the conclusions of the study are still valid. Another explanation could be that the visual display with a virtual headlamp required an unnatural head position for the sound sources at the greatest distance from the equatorial plane. Using a rotating chair instead of a rigid chair could solve this problem in future experiments. Further, some of the elevated sound sources were located above the upper edge of the projection surface, which may result in localization bias. A possible alternative to avoid this problem could be the use of a virtual reality headset for visual feedback (as in [42]). On the other hand, the use of virtual reality headsets has a number of disadvantages for hearing research applications: They affect the head-related transfer functions and are not compatible with wearing hearing aids during an experiment. There is also an influence on the movement behavior of users, which limits the ecological validity of experiments in hearing research aiming at natural user behavior [43]. Another possible explanation is that for vertical localization mainly cues at high frequencies are used [41, 44]. However, for all reproduction methods except for the “nsp45” method, spatial aliasing is largest at high frequencies, and can lead to blurred localization cues. For horizontal localization, also low-frequency components play a major role, which may explain the larger robustness of horizontal localization when using periphonic rendering methods and sound sources not on the equatorial plane. With “nsp45” each virtual sound source is reproduced by a single physical source. Therefore no spectral effects of spatial aliasing can be found, only angular mismatch.

The comparison of layouts with 29 and 45 loudspeakers partially reveals a non-optimal layout choice of having two main loudspeaker rings. Both in terms of technical measures and perceived localization, layouts with 16 additional loudspeakers result in only a small benefit or, in some cases, a deterioration. However, a benefit would have been expected if a more regular 3D layout had been used.

To compare the vertical localization bias, in [45], the average elevation error for anechoic speech stimuli played back via headphones with individual head-related transfer functions was approximately 10° and had large interindividual differences. The vertical localization bias for periphonic rendering methods in the current study was in the range of 10°–30°. A part of the observed localization bias might therefore not be caused by limitations of the rendering methods, but might be inherent to human localization performance.

In literature, the vertical localization error for sources in front of the subject is approximately 4° and rises up to 19° for more peripheral sources [36]. The measured vertical errors in the current study were in a similar range.

For the horizontal localization error for sources on the equatorial plane, a bias of approximately 2° was detected. A possible reason for this could be a mismatch between the audio and the video projection in the setup used. Furthermore, listeners may have been influenced by the fact that more virtual sources were presented on the left side than on the right side. In future studies, it might be beneficial to choose a more balanced design of virtual source positions to exclude that this has an effect on the localization accuracy. A previous localization study [46] reported an undershoot in localization accuracy for sources more than 30° to the side in case of a non-curved screen. This was found to improve when a curved screen was used. In the current study, the projection screen was also curved, this should therefore not be the determining factor for the observed bias.

Literature values of horizontal localization errors for sources in front of the subject are approximately 1–2° [36] and thus in a similar range as found in the current study.

Below the equatorial plane, some of the periphonic rendering methods showed larger errors in horizontal localization than did the horizontal rendering methods. It should be noted that the elevated stimuli for the horizontal rendering methods were rendered on the equatorial plane, in contrast to the periphonic rendering methods where the stimuli were rendered with elevation. This could also have an effect on the perceptual quality ratings of these sources.

Localization accuracy measured with MAA results in slightly higher values than reported in the literature. Mills [15] and Litovsky [47] report values of human spatial resolution of about 1°. For sources on the equatorial plane, the localization accuracy was 2–4°. For sources below the equatorial plane, where the loudspeaker density was lower, the accuracy dropped to MAA values of 3–7°.

The perceived localization ability depends on the acoustic environment. The “minimal” playback method, based on a single loudspeaker, was rated best in the “speech” environment with a single primary sound source. In the other two, more complex environments, the “minimal” method was rated as very poor for localizing the sound source. The “nsp45” method, which uses only one speaker per source, was not rated best in the concert situation. This could be due to insufficient spatial resolution, which leads to a lack of separation of the musical instruments. In the street scene, moving sources cause positional jumps in the NSP method, which could have caused the large variance in the rating of this method in the street scene. The spatial presence rating again shows a large effect of the acoustic environment. Whereas in the concert environment, the “hoa2d” renderer was rated best, it was “nsp45” in the “speech” condition, which differed only in the stimulus and number of sources, but was using the same room acoustic properties. In the rating of naturalness a similar picture can be found; again, “hoa2d” was rated most natural in the concert situation, whereas “nsp45” was rated most natural in the “speech” situation. A possible explanation could be that NSP rendering preserves more cues contributing to speech intelligibility than other methods, leading to a better performance in terms of speech understanding, as discussed by Grimm et al. [2] and Ahrens et al. [48]. The “speech” situation indicates a trend that periphonic rendering methods were rated more natural than horizontal methods. In summary, it can be stated that periphonic rendering methods provide the largest perceptual benefit in environments with low complexity.

Throughout all perceptual ratings, there were large differences between the ratings of the “speech” environment and the other two environments, especially for the “hoa2d” rendering. A possible reason for this is the large contrast in the characteristics of these scenes. The “speech” scene is of rather low complexity with only one stationary direct sound source, while the other two scenes are much more complex. Both other scenes contain a higher number of direct sound sources, and in the “street” scene, these sources are also dynamic. Therefore, in the “speech” scene the participants are probably more sensitive to small differences between the rendering methods, resulting in lower perceptual ratings.

In the “speech” scene, the best naturalness, spatial presence and realism was achieved with the “nsp45” rendering. As this rendering outperformed both HOA and VBAP renderings, it may be an indication that coloration artifacts and spatial blur induced by HOA rendering play an important role in perceptual quality ratings.

When comparing predicted and measured data, it was observed that vertical localization performance was significantly predicted by the objective error measures for sources above and below the equatorial ring. However, horizontal localization performance and perceptual quality ratings were not reliably predicted, except for a few exceptions. In total, these measures did not fully serve as a reliable predictor of human perception, so future studies could evaluate the prediction accuracy of other more perceptually motivated models, e.g., auditory model based direction estimation [49].

5 Conclusions

In comparison to horizontal rendering methods, periphonic rendering methods showed slightly improved vertical localization, but a strong localization bias towards the equatorial plane remained. No negative effects on horizontal localization performance on the equatorial plane were observed.

Perceptual quality, especially naturalness, depended on the acoustic environment. Periphonic reproduction was rated as more natural only in a simple scene with a single primary sound source in a simulated lecture hall. In a concert environment, higher-order horizontal Ambisonics rendering with maxrE$ \mathrm{ma}{\mathrm{x}}_{{\mathbf{r}}_E}$ decoding was rated higher than all other rendering methods. In the virtual street environment with moving sources, no clear advantage of either method was found.

Overall, it can be concluded that periphonic rendering methods offer a marginal advantage over horizontal rendering methods in simple environments. This advantage disappears in more complex environments. As virtual scenes are of increasing interest in hearing aid research, it is also of large interest how to choose an appropriate type of audio rendering setup and method. Especially in clinical applications, it might be desirable to know if a setup with reduced complexity in the laboratory leads to satisfactory results. The findings of the current study help to choose the appropriate rendering method for hearing aid research, depending on the given hearing task. The results show that the additional effort of setting up a periphonic loudspeaker setup in comparison to a horizontal setup does not necessarily lead to a large difference in perception.

Funding

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 352015383 – SFB 1330 B1/A5.

Conflict of interest

The authors declare that they have no conflicts of interest in relation to this article.

Data availability statement

Recordings of impulse responses and scenes in the virtual acoustic environments rendered with the different rendering methods are available at Zenodo, under reference [25].

Appendix

Table A1

Labels and coordinates (azimuth, elevation, and radius) of the loudspeakers and allocation to the rendering methods minimal, hoa2d, hoa2d_basic, hoa29_pinv, hoa29_allrad, nsp45, hoa45_allrad, and vbap45. If a loudspeaker was used for a rendering method, the respective field is marked with an “X”.

Table A2

Overview of one-sided t-test results for the vertical localization error towards the horizontal plane.

Table A3

Overview of t-test results for the horizontal localization error.

Table A4

Overview of t-test results of the MAA experiment.

Table A5

Overview of t-test results for the question “How well can the locations of each sound source be estimated?”. p-values are provided for the left and right tail of the statistics, respectively.

Table A6

Overview of t-test results for the question “How much do you feel spatially present in the scene?”. p-values are provided for the left and right tail of the statistics, respectively.

Table A7

Overview of t-test results for the question “How natural does the acoustic environment sound?”. p-values are provided for the left and right tail of the statistics, respectively.

Table A8

Overview of t-test results for the question “How much does your experience of the virtual environment resemble that of a real environment?”. p-values are provided for the left and right tail of the statistics, respectively.

Table A9

Overview of t-test results for the question “How extended are the sound sources?”. p-values are provided for the left and right tail of the statistics, respectively.

References

  1. J. Cubick, T. Dau: Validation of a virtual sound environment system for testing hearing aids. Acta Acustica united with Acustica 102, 3 (2016) 547–557. https://doi.org/10.3813/aaa.918972. [CrossRef] [Google Scholar]
  2. G. Grimm, J. Luberadzka, V. Hohmann: Virtual acoustic environments for comprehensive evaluation of model-based hearing devices. International Journal of Audiology 57, sup3 (2016) S112–S117. https://doi.org/10.1080/14992027.2016.1247501. [Google Scholar]
  3. B. Kapralos, M.R. Jenkin, E. Milios: Virtual audio systems. Presence: Teleoperators and Virtual Environments 17, 6 (2008) 527–549. https://doi.org/10.1162/pres.17.6.527. [CrossRef] [Google Scholar]
  4. V. Pulkki: Multichannel sound reproduction, in D. Havelock, S. Kuwano, M. Vorlaender (Eds.), Handbook of signal processing in acoustics, Springer, New York, NY, 2008, pp. 747–760. ISBN 978-0-387-30441-0. https://doi.org/10.1007/978-0-387-30441-0_38. [CrossRef] [Google Scholar]
  5. S. Bertet, J. Daniel, E. Parizet, O. Warusfel: Investigation on localisation accuracy for first and higher order ambisonics reproduced sound sources. Acta Acustica united with Acustica 99, 4 (2013) 642–657. https://doi.org/10.3813/aaa.918643. [CrossRef] [Google Scholar]
  6. H. Wierstorf, A. Raake, S. Spors: Assessing localization accuracy in sound field synthesis. Journal of the Acoustical Society of America 141, 2 (2017) 1111–1119. https://doi.org/10.1121/1.4976061. [CrossRef] [PubMed] [Google Scholar]
  7. T. Huisman, A. Ahrens, E. MacDonald: Ambisonics sound source localization with varying amount of visual information in virtual reality. Frontiers in Virtual Reality 2 (2021) 722321. https://doi.org/10.3389/frvir.2021.722321. [CrossRef] [Google Scholar]
  8. C. Oreinos, J.M. Buchholz: Objective analysis of ambisonics for hearing aid applications: effect of listeners head, room reverberation, and directional microphones. Journal of the Acoustical Society of America 137, 6 (2015) 3447–3465. https://doi.org/10.1121/1.4919330. [CrossRef] [PubMed] [Google Scholar]
  9. G. Grimm, S. Ewert, V. Hohmann: Evaluation of spatial audio reproduction schemes for application in hearing aid research. Acta Acustica united with Acustica 101, 4 (2015) 842–854. https://doi.org/10.3813/aaa.918878. [CrossRef] [Google Scholar]
  10. S. Bertet, J. Daniel, L. Gros, E. Parizet, O. Warusfel: Investigation of the perceived spatial resolution of higher order ambisonics sound fields: a subjective evaluation involving virtual and real 3D microphones, in Audio Engineering Society Conference: 30th International Conference: Intelligent Audio Environments, Saariselkä, Finland, March 15–17, 2007, Audio Engineering Society, pp. 217–225. [Google Scholar]
  11. S. Favrot, M. Marschall, J. Käsbach, J. Buchholz, T. Weller: Mixed-order ambisonics recording and playback for improving horizontal directionality, in 131st Audio Engineering Society Convention, New York, USA, October 20–23, 2011, Audio Engineering Society, pp. 641–647. [Google Scholar]
  12. P.N. Samarasinghe, M.A. Poletti, S.M.A. Salehin, T.D. Abhayapala, F.M. Fazi: 3D soundfield reproduction using higher order loudspeakers, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May, 2013, IEEE, pp. 306–310. https://doi.org/10.1109/ICASSP.2013.6637658. [CrossRef] [Google Scholar]
  13. V. Pulkki: Virtual sound source positioning using vector base amplitude panning. Journal of the Audio Engineering Society 45, 6 (1997) 456–466. [Google Scholar]
  14. J. Daniel: Représentation de champs acoustiques, application à la transmission et à la reproduction de scènes sonores complexes dans un contexte multimédia. PhD thesis, Université Pierre et Marie Curie (Paris VI), Paris, 2001. [Google Scholar]
  15. A.W. Mills: On the minimum audible angle. Journal of the Acoustical Society of America 30, 4 (1958) 237–246. https://doi.org/10.1121/1.1909553. [CrossRef] [Google Scholar]
  16. S. Favrot, J.M. Buchholz: LoRA: A loudspeaker-based room auralization system. Acta Acustica united with Acustica 96, 2 (2010) 364–375. https://doi.org/10.3813/aaa.918285. [CrossRef] [Google Scholar]
  17. B.U. Seeber, S. Kerber, E.R. Hafter: A system to simulate and reproduce audio-visual environments for spatial hearing research. Hearing Research 260 (2010) 1–10. https://doi.org/10.1016/j.heares.2009.11.004. [CrossRef] [PubMed] [Google Scholar]
  18. S. Spors, H. Wierstorf, A. Raake, F. Melchior, M. Frank, F. Zotter: Spatial sound with loudspeakers and its perception: A review of the current state. Proceedings of the IEEE 101, 9 (2013) 1920–1938. https://doi.org/10.1109/jproc.2013.2264784. [CrossRef] [Google Scholar]
  19. F. Zotter, M. Frank: All-round ambisonic panning and decoding. Journal of the Audio Engineering Society 60, 10 (2012) 807–820. [Google Scholar]
  20. A.J. Heller, E.M. Benjamin, R. Lee: A toolkit for the design of ambisonic decoders, in Linux Audio Conference, CCRMA, Stanford University, California, April 12–15, 2012. Available at http://www.academia.edu/download/30883409/18.pdf. [Google Scholar]
  21. A.J. Heller, E.M. Benjamin: The Ambisonic Decoder Toolbox: Extensions for partial-coverage loudspeaker arrays, Linux Audio Conference, ZKM, Karlsruhe, Germany, May 1–4, 2014. [Google Scholar]
  22. C. Böhm, D. Ackermann, S. Weinzierl: A multi-channel anechoic orchestra recording of Beethoven’s Symphony No. 8 Op. 93. Journal of the Audio Engineering Society 68, 12 (2021) 977–984. https://doi.org/10.17743/jaes.2020.0056. [CrossRef] [Google Scholar]
  23. J. Grimm, W. Grimm: Schneewittchen. Audiobook, spoken by Johannes Ackner, 1812. Available at https://www.vorleser.net/grimm_schneewittchen/hoerbuch.html (accessed 22 June 2018). [Google Scholar]
  24. M.M.E. Hendrikse, G. Llorach, V. Hohmann, G. Grimm: Movement and gaze behavior in virtual audiovisual listening environments resembling everyday life. Trends in Hearing 23 (2019) 233121651987236. https://doi.org/10.1177/2331216519872362. [CrossRef] [Google Scholar]
  25. M. Gerken, V. Hohmann, G. Grimm: Comparison of 2D and 3D multichannel audio rendering methods for hearing research applications using technical and perceptual measures – impulse responses and scene recordings. Zenodo, 2023. https://doi.org/10.5281/zenodo.10037482. [Google Scholar]
  26. D. Rocchesso, J. Smith: Circulant and elliptic feedback delay networks for artificial reverberation. IEEE Transactions on Speech and Audio Processing 5, 1 (1997) 51–63. https://doi.org/10.1109/89.554269. [CrossRef] [Google Scholar]
  27. T. Roosendaal: The Official Blender Game Kit: interactive 3D for artist, No Starch Press, San Francisco, 2003. [Google Scholar]
  28. T. Roosendaal: Blender, version 2.79b, 2018. Available at https://download.blender.org/release/Blender2.79/. [Google Scholar]
  29. J. Heeren, G. Grimm, S. Ewert, V. Hohmann: Video screens for hearing research: transmittance and reflectance of professional and other fabrics. ArXiv preprint, 2023. Available at https://doi.org/10.48550/ARXIV.2309.11430. [Google Scholar]
  30. G. Grimm, J. Luberadzka, V. Hohmann: A toolbox for rendering virtual acoustic environments in the context of audiology. Acta Acustica united with Acustica 105, 3 (2019) 566–578. https://doi.org/10.3813/aaa.919337. [CrossRef] [Google Scholar]
  31. G. Grimm, T. Herzke: TASCAR version 0.225.1, 2022. Available at https://github.com/gisogrimm/tascar. [Google Scholar]
  32. MATLAB: Version 9.7.0 (R2019b). The MathWorks Inc., Natick, Massachusetts, 2019. [Google Scholar]
  33. M.A. Gerzon: General metatheory of auditory localisation. Audio Engineering Society Convention 92, Audio Engineering Society, 1992. [Google Scholar]
  34. A.J. Heller, R. Lee, E.M. Benjamin: Is my decoder ambisonic? Audio Engineering Society – 125th Audio Engineering Society Convention 1 (2008) 719–740. [Google Scholar]
  35. I. Holube, S. Fredelake, M. Vlaming, B. Kollmeier: Development and analysis of an international speech test signal (ISTS). International Journal of Audiology 49, 12 (2010) 891–903. https://doi.org/10.3109/14992027.2010.506889. [CrossRef] [PubMed] [Google Scholar]
  36. J.C. Makous, J.C. Middlebrooks: Two-dimensional sound localization by human listeners. Journal of the Acoustical Society of America 87, 5 (1990) 2188–2200. https://doi.org/10.1121/1.399186. [CrossRef] [PubMed] [Google Scholar]
  37. M. Gerken, G. Grimm, V. Hohmann: Evaluation of real-time implementation of 3D multichannel audio rendering methods, in DAGA 2020 – 46 Jahrestagung für Akustik, Hannover, 16–19 March, 2020. [Google Scholar]
  38. D.R. Perrott, S. Pacheco: Minimum audible angle thresholds for broadband noise as a function of the delay between the onset of the lead and lag signals. Journal of the Acoustical Society of America 85, 6 (1989) 2669–2672. https://doi.org/10.1121/1.397764. [CrossRef] [PubMed] [Google Scholar]
  39. D.R. Perrott, K. Saberi: Minimum audible angle thresholds for sources varying in both elevation and azimuth. Journal of the Acoustical Society of America 87, 4 (1990) 1728–1731. https://doi.org/10.1121/1.399421. [CrossRef] [PubMed] [Google Scholar]
  40. A. Lindau, V. Erbes, S. Lepa, H.-J. Maempel, F. Brinkman, S. Weinzierl: A spatial audio quality inventory (SAQI). Acta Acustica united with Acustica 100, 5 (2014) 984–994. https://doi.org/10.3813/aaa.918778. [CrossRef] [Google Scholar]
  41. S.K. Roffler, R.A. Butler: Factors that influence the localization of sound in the vertical plane. Journal of the Acoustical Society of America 43, 6 (1968) 1255–1259. https://doi.org/10.1121/1.1910976. [CrossRef] [PubMed] [Google Scholar]
  42. S. Fargeot, O. Derrien, G. Parseihian, M. Aramaki, R. Kronland-Martinet: Subjective evaluation of spatial distorsions induced by a sound source separation process, in EAA Spatial Audio Signal Processing Symposium, Paris, France, 6–7 September, 2019. https://doi.org/10.25836/SASP.2019.15. [Google Scholar]
  43. G. Llorach, M.M.E. Hendrikse, G. Grimm, V. Hohmann: Comparison of a head-mounted display and a curved screen in a multi-talker audiovisual listening task. ArXiv preprint, 2020. https://doi.org/10.48550/ARXIV.2004.01451. [Google Scholar]
  44. R.A. Butler, R.A. Humanski: Localization of sound in the vertical plane with and without high-frequency spectral cues. Perception & Psychophysics 51, 2 (1992) 182–186. https://doi.org/10.3758/bf03212242. [CrossRef] [PubMed] [Google Scholar]
  45. D.R. Begault, E.M. Wenzel, M.R. Anderson: Direct comparison of the impact of head tracking, reverberation, and individualized head-related transfer functions on the spatial perception of a virtual speech source. Journal of the Audio Engineering Society 49, 10 (2001) 904–916. [Google Scholar]
  46. F. Winter, H. Wierstorf, S. Spors: Improvement of the reporting method for closed-loop human localization experiments. in 142nd Audio Engineering Society Convention, Berlin, Germany, May 20–23, 2017. [Google Scholar]
  47. R.Y. Litovsky: Developmental changes in the precedence effect: Estimates of minimum audible angle. Journal of the Acoustical Society of America 102, 3 (1997) 1739–1745. https://doi.org/10.1121/1.420106. [CrossRef] [PubMed] [Google Scholar]
  48. A. Ahrens, M. Marschall, T. Dau: Measuring and modeling speech intelligibility in real and loudspeaker-based virtual sound environments. Hearing Research 377 (2019) 307–317. https://doi.org/10.1016/j.heares.2019.02.003. [CrossRef] [PubMed] [Google Scholar]
  49. M. Dietz, S.D. Ewert, V. Hohmann: Auditory model based direction estimation of concurrent speakers from binaural signals. Speech Communication 53 (2011) 592–605. https://doi.org/10.1016/j.specom.2010.05.006. [CrossRef] [Google Scholar]

Cite this article as: Gerken M. Hohmann V. & Grimm G. 2024. Comparison of 2D and 3D multichannel audio rendering methods for hearing research applications using technical and perceptual measures. Acta Acustica, 8, 17.

All Tables

Table 1

Overview of the rendering methods. The first column specifies the label used in this paper, the second column the rendering method, and in case of HOA also the decoder type. The column “Per.” indicates whether the rendering method was using periphonic or horizontal reproduction. This column is followed by the number of loudspeakers N, and an indication if the method was applied in the localization experiment (“Loc”), in the minimum audible angle experiment (“MAA”), and in the perceptual quality rating (“Rating”). An “X” indicates that the rendering method was used in the respective experiment, “(X)” indicates that data was taken from a previous study [37].

Table 2

Angular errors Erv,2D$ {E}_{{\mathbf{r}}_v,2D}$, Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$, ErE,2D$ {E}_{{\mathbf{r}}_E,2D}$ and ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$ for the tested rendering methods, averaged across 360 points on a ring (2D) or 2432 points on a sphere (3D), respectively. The standard deviations are given in parentheses.

Table 3

Median azimuth difference Δaz,med between the energy localization vector rE and the source positions used in the localization test.

Table 4

Median elevation difference Δel,med between the energy localization vector rE and the source positions used in the localization test.

Table 5

Pearson’s correlation coefficients r between Δel,med and subjective localization error in elevation (first row), and Δaz,med and subjective localization error in azimuth (second row). Correlation values in parentheses are not significant (p ≥ 0.05), significant correlation values are printed in bold.

Table 6

Pearson’s correlation coefficients r between technical error measure Erv,2D$ {E}_{{\mathbf{r}}_v,2D}$ and subjective data from perceptual quality rating experiment. Correlation values in parentheses are not significant (p ≥ 0.05), significant correlation values are printed in bold. The first column titled “question” refers to the list of perceptual quality ratings introduced in Section 2.5.3.

Table 7

Pearson’s correlation coefficients r between technical error measure Erv,3D$ {E}_{{\mathbf{r}}_v,3D}$ and subjective data from perceptual quality rating experiment. Correlation values in parentheses are not significant (p ≥ 0.05), significant correlation values are printed in bold. The first column titled “question” refers to the list of perceptual quality ratings introduced in Section 2.5.3.

Table 8

Pearson’s correlation coefficients r between technical error measure ErE,2D$ {E}_{{\mathbf{r}}_E,2D}$ and subjective data from perceptual quality rating experiment. Correlation values in parentheses are not significant (p ≥ 0.05), significant correlation values are printed in bold. The first column titled “question” refers to the list of perceptual quality ratings introduced in Section 2.5.3.

Table 9

Pearson’s correlation coefficients r between technical error measure ErE,3D$ {E}_{{\mathbf{r}}_E,3D}$ and subjective data from perceptual quality rating experiment. Correlation values in parentheses are not significant (p ≥ 0.05), significant correlation values are printed in bold. The first column titled “question” refers to the list of perceptual quality ratings introduced in Section 2.5.3.

Table A1

Labels and coordinates (azimuth, elevation, and radius) of the loudspeakers and allocation to the rendering methods minimal, hoa2d, hoa2d_basic, hoa29_pinv, hoa29_allrad, nsp45, hoa45_allrad, and vbap45. If a loudspeaker was used for a rendering method, the respective field is marked with an “X”.

Table A2

Overview of one-sided t-test results for the vertical localization error towards the horizontal plane.

Table A3

Overview of t-test results for the horizontal localization error.

Table A4

Overview of t-test results of the MAA experiment.

Table A5

Overview of t-test results for the question “How well can the locations of each sound source be estimated?”. p-values are provided for the left and right tail of the statistics, respectively.

Table A6

Overview of t-test results for the question “How much do you feel spatially present in the scene?”. p-values are provided for the left and right tail of the statistics, respectively.

Table A7

Overview of t-test results for the question “How natural does the acoustic environment sound?”. p-values are provided for the left and right tail of the statistics, respectively.

Table A8

Overview of t-test results for the question “How much does your experience of the virtual environment resemble that of a real environment?”. p-values are provided for the left and right tail of the statistics, respectively.

Table A9

Overview of t-test results for the question “How extended are the sound sources?”. p-values are provided for the left and right tail of the statistics, respectively.

All Figures

thumbnail Figure 1

Positions of the virtual sound sources psrc in the localization experiment (black symbols). The white area indicates the projection screen, the loudspeaker positions are represented as gray dots. Black crosses denote source positions with 0° elevation, triangles elevation > 0°, and squares elevation < 0°. Please note that not all rendering methods used all loudspeakers. Only the frontal part of the setup, i.e. the area including the virtual sound source positions, is shown here. The setup also involved loudspeakers at positions that are not shown here.

In the text
thumbnail Figure 2

Median and interquartile range of the vertical localization error, pooled across test participants and source positions. The black lines represent the negative median elevation of tested virtual positions of the respective elevation group; values close to those values mean that the sources were localized on the equatorial plane.

In the text
thumbnail Figure 3

Median and interquartile range of the horizontal localization error, pooled across test participants and source positions.

In the text
thumbnail Figure 4

Median and interquartile range of the MAA in degrees. For comparison, the data of renderings with less than 45 channels are taken from a former study with a different group of ten normal-hearing participants [37].

In the text
thumbnail Figure 5

Median and interquartile range of the rating responses to question “How well can the locations of each sound source be estimated?”.

In the text
thumbnail Figure 6

Median and interquartile range of the rating responses to the question “How much do you feel spatially present in the scene?”.

In the text
thumbnail Figure 7

Median and interquartile range of rating responses to the question “How natural does the acoustic environment sound?”.

In the text
thumbnail Figure 8

Median and interquartile range of the rating responses to the question “How much does your experience of the virtual environment resemble that of a real environment?”.

In the text
thumbnail Figure 9

Median and interquartile range of the rating responses to the question “How extended are the sound sources?”.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.