On the improvement of accommodation to non-individual HRTFs via VR active learning and inclusion of a 3D room response

This study examines the efficiency of a training protocol using a virtual reality application designed to accelerate individual’s selection of, and accommodation to, non-individualized HRTF profiles. This training introduces three elements to hasten audio localization performance improvement: an interactive HRTF selection method, a parametric training program based on active learning, and a relatively dry room acoustic simulation designed to increase the quantity of spatial cues presented. Participants rapidly selected an HRTF ( 5 min) followed by training over three sessions of 12 min distributed over 5 days. To study the impact of the room acoustic component on localization performance evolution, participants were divided into two groups: one acting as control reference, training with only anechoic renderings, the other training in reverberant conditions. The efficiency of the training program was assessed across groups and the entire protocol was assessed through direct comparisons with results reported in previous studies. Results indicate that the proposed training program led to improved learning rates compared to that of previous studies, and that the included room response accelerated the learning process.


Introduction
Binaural synthesis is a signal processing technique used to render spatial auditory scenes over headphones. It relies on the application of direction-dependent audio cues to a monophonic signal, mimicking time and frequency transformations resulting from the propagation of an acoustic wave from a sound source to the listener's ear canals [1,2]. The technique is used to simulate 3D sounds in practically all wearable augmented and Virtual Reality (VR) systems today.
A recurrent problem in binaural synthesis is that any discrepancy between the simulation and the real phenomenon inevitably impacts the perceived auditory space. A typical discrepancy is the use of direction-dependent cues referred to as Head Related Transfer Functions (HRTFs) not measured on the user listening to the rendering. The resulting non-individualized synthesis is often the cause of degraded externalization or decreased localization accuracy [3,4].
Using non-individual HRTFs is common practice as measuring them on a per-user basis is not practical. Methods have been designed to, given a database of existing HRTFs, lead users to select one that minimizes a given criteria, e.g. localization errors [5]. Still, these perceptual best-match HRTFs generally result in less precise auditory space perception than individual HRTFs [6]. To further improve one's affinity to a given HRTF, training procedures have been proposed, showing that users could adapt to non-individual HRTFs, exhibiting localization performance approaching that of users relying on individual HRTFs [6,7].
Paired together, the HRTF selection and training procedures that have been proposed to date generally require users to spend more than an hour to achieve acceptable localization performance (see Sect. 2). The first objective of the present study is to introduce and evaluate a novel HRTF selection and training procedure, designed to accelerate the overall adaptation process. The second objective of the present study is to assess whether the addition of a room acoustic can have a pronounced impact on auditory accommodation to non-individual HRTFs compared to anechoic conditions, given the additional information provided.
A review of previous work is presented in Section 2. The exact scope of the research and the hypothesis under consideration are presented in Section 3. The description of the experiment is presented in Section 4. The results are reported in Section 5, and discussed in Section 6.
The individualization method used in the present study, detailed in Section 4.4, is based on subjective HRTF selection. The concept of these methods is to expose users to stimuli spatialized with various HRTFs and to have them rate which renders best the expected auditory source positions [6]. The principal alternative is objective selection, often taking the form of a localization test [5], where the best HRTF is the one that maximizes participant localization accuracy.
The premise of the present study is that HRTF selection can be reduced to its essential: that given the variance on bestmatch selection [17,18], and its limited impact on immediate localization performance [6,18], it is preferable to capitalize on training to improve localization accuracy. Still, previous studies have shown that using perceptually poorly-rated HRTFs led to a significant increase in adaptation time compared to best-rated ones [6,19]. The selection method introduced in Section 4.4 has been designed as a tradeoff between these two considerations: providing users with a "good enough" match HRTF in a minimum amount of time.

Introduction
While localization accuracy is not the only issue resulting from non-individualized rendering [20], the current focus is on this criterion for the following literature review as well as the training program presented in Section 4.6. Readers are referred to Wright and Zhang [21] or Mendonça [22] for more general reviews on the broader topic of HRTF learning.
It has been established that one can adapt to modified HRTFs, e.g. after ear-molds inserted in pinna [7,[23][24][25], or learn to use non-individual HRTFs [3,6,19,26,27]. Studies have even shown that one can adapt to distorted HRTFs, e.g. in Majdak et al. [28] where participants suffering from hearing loss learned to use HRTFs whose spectrum had been warped to move audio cues back into frequency bands they could perceive. HRTF learning is not only possible, but lasting in time [19,26,29]: users have been shown to retain performance improvements up to 4 months after training [26]. Studies have showed that, given enough time, users using non-individual HRTFs can achieve localization performance on par with participants using their own individual HRTFs [6,19].

Impact of training protocol parameters
Learning methods explored in previous studies are often based on a localization task. This type of learning is referred to as explicit learning [22], as opposed to implicit learning where the training task does not immediately focus participant attention on localization cues [6,19]. Performancewise, there is no evidence that suggests either type is better than the other. Implicit learning gives more leeway for task design gamification. The technique is more and more applied to the design of HRTF learning methods [6,19,27,30], and while its impact on HRTF learning rates remains uncertain [27], its benefit for learning in general is however well established [31]. Explicit learning on the other hand more readily produces training protocols where participants are consciously focusing on the learning process [32], potentially helping with the unconscious auditory mental map re-adjustment.
As much as the nature of the task, providing feedback can play an important role during learning. VR technologies are more and more relied upon to increase feedback density in the hope of increasing HRTF learning rates. While Majdak et al. [28] results encourage the use of a visual virtual environment, it has been reported that proprioceptive feedback can equally be used to improve learning rates [6,33]. There is however a growing consensus on the use of adaptive (i.e. head-tracked) binaural rendering during training to improve learning rates [7,27], despite the generalized use of head-locked localization tasks to assess performance evolution [22].
Studies on the training stimulus suggest that learning can extend to more than the stimulus used during learning [27]. This result is likely dependent on stimuli relative "quality" regarding auditory localization, i.e. whether they present the transient energy and the broad frequency content necessary for auditory space discrimination [34,35].
There is no clear cut result on optimum training sessions duration and spread. Training session duration reported in previous studies ranges from % 8 min [36] to % 2 h [28]. Comparative analysis argues in favor of several short training sessions over long ones [22]. Training session spread is also widely distributed in the literature, ranging from all sessions in one day [35] vs. one every week or every other week [19]. Where Kumpik et al. [35] results suggest spreading training over time benefits learning (all in one day vs. spread over 7 days), direct comparison of Stitt et al. [19] and Parseihian and Katz [6] suggests that weekly sessions and daily sessions result in the same overall performance improvement (for equal total training durations). There is some example of latent learning (improvement between sessions) in the literature [36], naturally encouraging the spread of training sessions. Regardless of duration and spread, studies have shown that learning saturation occurs after a while. In Majdak et al. [37], most of the training effect took place within the first 400 trials (%160 min), a result comparable to that reported by Carlile et al. [38] who reached saturation after 7-9 blocks of 36 trials each.
One of the critical questions not fully answered to date is the role of the HRTF in the training process. It would appear that a certain degree of affinity between a participant and the training HRTF facilitates learning [6,19]. The question remains whether this affinity also furthers the saturation point.

Impact of room response on localization accuracy
Previous studies have examined the impact of various degrees of acoustic conditions on binaural cues, such as ground reflections for small animals [39], or subjective perception of an auditory scene [40]. There are surprisingly few studies which have examined the impact of room acoustics on localization in particular, with tested conditions often being quite limited, and results varying between studies. The room acoustic response can be summarized by the term reverberation, used here in its broadest sense, pertaining to single or multiple early reflections, late reverberation, or combinations of both.
In an early study, Hartmann [41] examined the effect of Reverberation Time (RT) and ceiling height on localization of distant real speakers (12 m) in the frontal-horizontal plane only using a 500 Hz tone burst. Using the ESPRO variable acoustics facility at IRCAM, RT of 1 s and 5.5 s were compared, as well as a lowered ceiling condition with an RT of 2.8 s, the later changing both the geometry and RT. Results showed no effect between RT conditions, while the changes resulting from the lowered ceiling condition improved azimuthal localization performance. In contrast, Rakerd and Hartmann [42] tested specific single reflection directions compared to anechoic conditions, showing that a lateral reflection decreased azimuthal localization performance. Subsequently, Guski [43] also focused on single reflections in a dry room (RT of % 0.5 s), and showed that the impact of a lateral reflection on localization performance was not systematic, depending on reflection position relative to the source, while a floor reflection improved localization. Results also suggested that the presence of a floor reflection benefited elevation localization performance where a ceiling reflection (1 m above listener) impeded overall localization accuracy, contrary to Hartmann [41]. Using a spherical array of speakers for individual reflection directions and a horizontal array for late reverberation (reverberator units with broadband RT of 0.4 s), Bech [44] examined the level and decay thresholds of reflections inducing coloration or image shift of a target sound source. Results were analyzed and discussed relative to previous studies, including those on the precedence effect.
Taking advantage of virtual reality audio binaural synthesis, Begault [45] studied the addition of two spaced floor reflections, 64 early reflections calculated from a 2D raytracing simulation of a reversefan shaped room, and a diffuse field late reverberation, using the same HRTF across subjects. Tested speech stimuli on the horizontal plane showed no effect of these additions on azimuthal performance, while a general vertical bias was observed across subjects in the reverberant condition. While no explanation was proposed for said bias, one can hypothesize that the unrealistic reflection conditions (both in 2D and the limited room modeling of the time) could induce perceptual errors. Testing localization ability over repetitions, Shinn-Cunningham [46] employed noise bursts at a distance of 1 m in a real room (RT of 0.5 s). Results showed improvement in localization performance with exposure through test repetitions (contrary to results of Hartmann [41]) for seven subjects. While mean elevation performance was judged slightly poorer than other published studies in anechoic conditions, cross-comparisons of such studies with so few subjects are difficult to interpret.
Further exploring binaural synthesis conditions, a pair of studies tested horizontal source positions using a 3D raytracing room model (dynamic reaction to head-rotations) combined with a static late reverberation with a RT of 1.5 s for individual and generic HRTFs [47,48]. Results showed reduced azimuthal error in the reverberant condition along with a vertical bias shift in localization for speech stimuli, presented solely in the horizontal plane.
Similar to the above previously mentioned works, Angel et al. [49] examined the effect of adding a single floor reflection in the context of binaural synthesis. Localization of a 1 s amplitude modulated noise burst was evaluated using individual HRTFs and a diffuse late field reverberation (RT of 0.5 and 1 s) for sources either on the horizontal plane or along the 45°cone of confusion. 1 Improvements were observed regarding front-back confusions on the horizontal plane for 6-out-of-9 subjects, as well as a reduction in azimuthal errors. No effect was observed on elevation errors for the cone of confusion condition, or between late reverberation time conditions. The inclusion of headtracking, providing dynamic rendering of acoustic cues with changes in head orientation, was determined more beneficial than the inclusion of the floor reflection.
More recently, Nykänen et al. [50] examined the impact of reverberation using recordings made via dummy-head in a real classroom with speech and noise burst stimuli on the horizontal plane only. Results showed that front-back confusion rates were lower in the reverberant conditions compared to anechoic conditions.
In traversing these previous studies, the results are rather mixed, if not contradictory. However, it would appear that early arriving reflections with the same azimuth (floor/ceiling) as the direct sound tend to improve lateral localization performance. Results are inconclusive with regards to improvements in elevation localization performance. Tested source distances ranged from 1 m to 12 m, with source positions being almost exclusively limited to the horizontal plane and at a limited number of positions. Test stimuli comprised speech, tone bursts, or noise bursts of various duration. In virtual conditions, HRTFs ranged from a single "generic" HRTF for all subjects to individual measured HRTFs with no observed effects.

Aim and scope of the research
The current study investigates the efficiency of a novel (1) HRTF selection and (2) training program, as well as (3) the impact of a virtual room acoustic simulation on 1 A cone-of-confusion is defined by the locus of positions having the same inter-aural time difference, thereby limiting localization cues to spectral changes only [51]. Typical localization errors are front/back or up/down confusions, where the perceived source position is mirrored across symmetry planes. Left/right confusions are non-existent in listeners without substantial hearing loss in one ear. HRTF learning rate. All three components have been designed as part of a solution to achieve non-individual HRTF accommodation and subsequent audio localization proficiency in a minimum amount of time and under unsupervised conditions.
The HRTF selection method was designed in the hope of providing participants with a good enough perceptual fit to accelerate the training process [6] while requiring as little of their time as possible. The learning program was designed to be efficient and entertaining, informing participants on the problems of non-individual binaural rendering and providing exercises to overcome them. Finally, the room acoustic response simulation combined with the anechoic rendering was proposed under the hypothesis that multiplying position-dependent spatial cues would further participants' ability to adapt to a new HRTF. The simulated acoustic space was not associated with the visual scene presented, to avoid multimodal interactions. A neutral virtual visual scene was presented during the experiment, as the intent was to examine the impact of reverberation as a scene-agnostic localization enhancer. The term self-coherent room response is hereafter used to refer to this paradigm.
To characterize the efficiency of the HRTF selection and training program, the experimental protocol has been conceived to enable comparative analysis with previous studies [6,19]. Learning is assessed based on localization performance for sources on the whole sphere, not constrained to frontal directions or horizontal plane, using an ego-centered response interface in VR. To characterize the impact of the room acoustic simulation on HRTF adaptation, a control group was constituted with participants training under anechoic condition.
The hypotheses to be tested are as follows: H1 Active involvement of participants during the HRTF selection procedure reduces initial (a) localization errors and (b) confusion rates.
H2 A self-coherent room response reduces (a) localization errors and (b) confusion rates prior to training.
H3 Initial participant performance predicts degree of improvement.
H4 A self-coherent room response improves HRTF learning rate evaluated via (a) localization errors and (b) confusion rates.
H5 The proposed HRTF selection method is at least as good (regarding localization accuracy with the selected HRTF) as those proposed previously in the literature.
H6 The proposed training program improves learning rates compared to previously proposed methods.

.1 General experiment description
A total of 24 adults participated in the experiment (age 19-64 years, mean 29 ± 2 years, 7 women), none selfreported any hearing deficit. Participants first selected their best-match HRTF from a pre-existing set, based on the selection process described in Section 4.4. They then proceeded to a first localization task, described in Section 4.5.
The complete experiment sequence is illustrated in Figure 1. All participants performed the first localization task L0.1 in the anechoic condition. They were then evenly distributed into two groups: G-anech or G-reverb, performing the second localization task L0.2 in either anechoic or reverberant conditions. Group assignment was designed so that initial performance of both groups (L0.1) was evenly balanced regarding analysis metrics described in Section 5.1. Participants then underwent three training sessions, each separated from the next by 1-3 nights, each followed by a localization task for performance assessment. Localization tasks lasted % 5 min, training sessions lasted exactly 12 min, as per [6,19]. To serve as incentive, the experiment was staged as a contest where for each group, participants ranking first and second would receive €250 and €150, respectively. A video illustrating the experiment is available online. 2

Test interface and binaural rendering
The experiment was conducted in an acoustically damped and isolated room (ambient noise level < 30 dBA). Participants were equipped with a tracked head-mounted display (Oculus CV1), open circumaural reference headphones (Sennheiser HD 600), and a pair of hand tracked controllers (Oculus Touch). This setup provided tracking information for both head and hands as well as presentation of the test interface to the participant via the visual display. Tracking latency (<6 ms) and precision (%1 cm) were sufficient for use in this study [52,53]. The virtual scene and user interface were designed and ran in Unity v2017.3, rendered at a frame-rate of % 90 fps. The entire experiment was conducted on a PC running a 64-bit Windows 10 on a 3.6 GHz Intel Core i7-7700 CPU with 64 Go of RAM coupled to an NVIDIA GeForce GTX 1080Ti graphic card. After setup, the entire experiment was user guided with the test administrator only present in case of difficulties.
Anechoic binaural rendering was performed using the Anaglyph binaural audio plugin v0.9.3b, embedded in a Cycling 0 74 Max v8.0.5 patch. Anaglyph uses HRTF convolution with a variable delay allowing for customization of Interaural Time Differences (ITD) through a personalizable morphological model. For improved suitability for sources in the near-field, frequency dependant Interaural Level Difference (ILD) correction is applied using a spherical head shadowing model in addition to HRTF parallax correction, providing the correct measured HRTF position filter for each ear independently. These corrections are only meaningful for sources at a radius of under 1.95 m (distance for which the HRTFs used in the present study were measured) i.e. only apply to the hand-held probe used during the training task (see Sect. 4.6). Full details of the functionality of the binaural engine are available in Poirier-Quinot and Katz [54]. Audio-visual latency was below the % 15 ms threshold of detectability [55].

Simulated room reverberation
To provide a self-coherent room reverberation condition, a Geometrical Acoustics (GA) room simulation was used to generate physically realistic Ambisonic Room Impulse Responses (RIRs), using CATT-Acoustic v.9.0. c:3/TUCT v1.1a:4. This GA software has been previously shown to be capable of generating comparable spatial RIRs when subjectively compared to measured data following a variety of perceptual attributes [56].
The reverberation condition employed a convolution with a second-order Ambisonic RIR with the direct sound contribution removed, as the direct sound was rendered separately as per the anechoic binaural condition. RIRs were simulated for 20 source positions, uniformly distributed on a 1.95 m radius sphere around the receiver. Second-order Ambisonic as well as the RIR grid density were adopted as a trade-off between spatial precision and processing power requirement, drawing on results from previous studies [57][58][59][60]. The room was based on a 5.7 Â 5.7 Â 5.4 m 3 cube, with shaping to avoid flutter effects, and a slight incline of the ceiling angle, illustrated in Figure 2. Ambisonic RIRs were simulated using the following settings in the GA software: Algorithm 1: Short calculation, basic auralization and max split order 1 with 150 000 rays. Resulting RIRs had an RT (T30) of % 0.15 s across all frequency bands (see Tab. 1). An example of the rendered RIR is shown in Figure 3, highlighting the density and decay rate of the room response, further emphasizing the continuity of reflections from the early to the latter part of the response when a realistic condition is considered. A spectrum highlighting the temporal-frequency characteristics of the RIR is also provided, highlighting the general uniformity and lack of a pronounced coloration effect. The use of a relatively short reverberation time was deemed an obvious requirement for inclusion in a generalized VR environment. If the reverberation was too pronounced, then it would contribute a noticeable reverberant effect to whatever VR scenario would be created by VR designers, and would be accumulated with any reverberator audio production effects associated to the actual VR scene. As such, this study examines the use of what could be termed a subtle room effect, so as not to overshadow, interact, or more importantly detract from the acoustic precepts of a VR scenario environment.
For a given position to spatialize during the localization task, an RIR was constructed on the fly in the rendering engine from a linear interpolation between the three nearest simulated RIR positions before convolution with the monophonic input stimulus signal. The resulting Ambisonic stream was then decoded using the virtual speaker approach [61] with 12 virtual speakers uniformly distributed on a sphere. The binaural encoding of each virtual speaker signal was also performed within the Anaglyph plugin, using the same ITD-corrected HRTF as for encoding   the direct/anechoic path. The final signal presented during the reverberation condition was the sum of both direct and reverb streams.

HRTF selection method
Participants were situated in a virtual scene, facing a 7-element menu of available HRTFs. The presented HRTFs were a "subjective orthogonal subset" [62,63] of the LISTEN database [64]. In each hand, they held a small taser (virtual representation of hand-tracked VR controllers) that they could activate to create an audio-visual spark. After a general introduction to the shortcomings of nonindividualized binaural rendering, participants were left to explore various sound source positions around their head with the taser. Their instructions were to select the HRTF that minimized overall spatial incoherence (confusions, misalignment, etc.) based on the auditory, proprioceptive, and visual perception of the taser position in the hand. No other stimuli were present during this task.
HRTFs of the LISTEN database are composed of 187 pairs of impulse responses, measured on a 1.95 radius sphere at 15°azimuth/elevation intervals, with a gap on the southern hemisphere below À45°. HRTF ITDs were adjusted based on participant head circumference to focus the selection on fine direction-dependent cues. The taser audio stimulus was a sequence of three bursts of white noise, 40 ms each with a 4 ms cosine-squared onset/offset ramp and a 70 ms inter-onset interval for a total duration of 180 ms. Full spectrum burst sounds were chosen in order to favor an adaptation to the complete spectral cues of the HRTF [47].

Localization task
Participants were situated in a virtual scene, facing a visual anchor, surrounded by a 1.95 m radius semi-transparent sphere. The sphere was lightly textured to facilitate spatial memory and provide an adequate frame of reference during the localization task [37]. Fixing one's head position and looking at the visual anchor triggered a circular progress bar, loading for % 1 s before triggering an audio stimuli. The stimulus was the same noise burst as in Section 4.4. Paired with the visual anchor mechanism of fixated gaze/head-orientation, the short audio stimulus prevented participants initiating head movement during presentation of the target stimulus [65]. The stimulus was randomly spatialized at one of 20 potential positions, evenly distributed on a sphere. Each position was repeated three times for a total of 60 localization trials. Participants used a pair of handheld blasters to indicate the perceived direction of origin of the audio source on the surrounding sphere, inspired from the "manual pointing" reporting method assessed in Bahu et al. [66]. When activated (trigger button) a blaster would shoot a small red ball onto the semi-transparent sphere that would correspond to participant perceived direction of origin. To facilitate aiming, each blaster was equipped with a laser sight. When evaluated on visible targets during debug sessions, the overall setup resulted in pointing errors below 2°.

Training task
The gamified training task resembled a hide-and-seek game, where participants had to identify a hidden audio source among visual potential targets, designate it, and repeat, with different degrees of complexity during the course of training. Upon startup, participants were presented with a selection menu from which they could enter any of the already unlocked training scenarios. A training scenario was composed of a predefined number of trials, each trial starting with the creation of at least two visual potential targets (white spheres in Fig. 4), and ending with participants designating one of them as the hidden audio source or active target. After facing all the trials of a given scenario, participants were returned to the scenario selection menu. Training automatically ended after the allotted 12 min. The training VR scene situated the participant on top of a 2 m radius platform, surrounded by a 360°sky dome, providing a frame of reference as for the sphere texture of the localization task.
During a trial, participants were free to look around to identify all the visual potential target positions of that trial. The active target would remain silent except when they looked directly at the visual anchor (green sphere in Fig. 4) and stood at the center of the platform, thus preventing the use of head-movements for localization. In their hands, participants held the same pair of tasers of Section 4.4. The primary use of the tasers was to serve as spatial audio probes: when hesitating between two or more visual target positions, participants could move a taser (i.e. a hand) towards each of them and trigger it at will to compare its sound to that of the hidden audio source. Both taser and hidden audio source used the same noise burst as in Section 4.4. This comparison mechanism was deemed an essential component of the training, leading participants to carefully listen to audio spectral cues, reinforcing sound-toposition relationships via proprioception and visual feedback. The taser also served as a selection pointer, with which participants could designate which visual target they thought was the active one. Upon visual target selection, the true active target emitted a sound, indicating if the choice was correct as well as revealing its position.
A total of 14 training scenarios were designed, divided into four "difficulty levels", representing increasingly more complex scenarios. Scenarios from the first difficulty level each focused on introducing one of the specific known issues of non-individualized binaural rendering (front-back confusions, cone of confusion, angular resolution, etc.). Subsequent scenarios re-exposed these problems, further increasing in difficulty. A complete list of all training scenarios as well as specifics on level design mechanics are provided in Appendix.

Results
Section 5.1 presents the analysis tools and metrics employed in quantifying the results. Section 5.2 examines the degree of participant active involvement in the HRTF selection task. Section 5.3 asserts initial group performance equivalence during the first anechoic localization task L0.1. The impact of reverberation on localization accuracy prior to learning (L0.1 vs. L0.2) is assessed in Section 5.4. Participants initial performances are correlated to their relative improvement in Section 5.5. The impact of reverberation on learning rate (L1 to L3 progression) is assessed in Section 5.6. The impact of the novel proposed learning paradigm, comparing the results of G-anech with that of previous studies, is assessed in Section 5.7.

Performance metrics
Analysis of localization performance was conducted using the interaural polar coordinate system [67], allowing for a rough separation of the role of different interaural cues throughout the analysis [19]. Two types of metrics were used during the analysis: angular and confusion metrics, all computed by comparing participant responses against target true position.
The four angular metrics considered were the interaural lateral and polar errors, the great-circle error, and the folded interaural polar error [19]. Lateral (resp. polar) error is the absolute difference between response and true target lateral (resp. polar) angle, wrapped to [À180: 180]. Great-circle error is the minimum arc, expressed in degrees, between response and true target position: Folded polar error is computed after compensation for any observed front-back or up-down confusions, applying a front-back (resp. up-down) axis-symmetry to the response position before calculating the polar error. Lateral and polar errors analyze specific aspects of the responses, highlighting what types of errors are occurring, while the basic global great-circle error indicates the overall magnitude of errors. It is understood that some metrics are derivatives of others, or not totally independent. Polar angle confusions were classified using a traditional segmentation of the cone of confusion (see [6,19]), (revised in [5]). The classification results in three potential confusion types: front-back, up-down, and combined, with a fourth type corresponding to precision errors, represented schematically in Figure 5. The precision category designates any response close enough to the real target so as not to be associated to the other confusion types.
The impact of independent variables (reverberation condition, session, etc.) on performance metrics was assessed using a Friedman test, as most distributions proved to follow non-normal (skewed) distributions.
Post-hoc paired-sample comparison was based on a Wilcoxon signed rank test for angle error distributions, and a chi-squared test for confusion rate distributions. Statistical significance was determined for p-values below the 0.05 threshold, the notation p < 1 is adopted to indicate p-values below 10 À3 . Reported p-values are those of posthoc tests. Effect size d is reported for pairs of distributions with p-values below 0.05, using Cohen and Phi statistics for angle error distributions and confusion rate distributions respectively. The notation d < 2 is adopted to indicate effect sizes below 0.1 (i.e. when two group means differ by < 0.1 standard deviation). Significant differences between fits pertaining to the analysis of learning rates in Section 5.7.2 were assessed based on comparisons of their coefficients. Significant difference is discussed when at least one of the coefficients of two fits differ beyond 50% of their estimate's 95% Confidence Interval (CI) [68]. The symbol "±" represents standard error throughout the paper.

Effect of HRTF selection task behavior on initial performance (H1)
Participant initial performances were compared to their behavior during the HRTF selection task, to assess whether the degree of active involvement in the task translates to improved initial localization performances. Participant involvement was quantified by examining several "behavioral" metrics: number of times the acoustic probe was activated, portion of the sphere explored, number of times the active HRTF was changed, and the duration of the selection task. The metric "sphere coverage" is a  percentage, computed based on how many "regions" of the sphere were probed, where regions are defined as Voronoi cells around the 20 potential target positions of the localization task. Correlation coefficients between these metrics and initial L0.1 performances are reported in Table 2.
Results show the magnitude of correlations never exceeding 0.30, indicating weak to no correlation. These results do not support H1.

Group baseline: initial performance comparison
Following the initial localization test results in L0.1, participants were divided into 2 groups, evenly matching great-circle and front-back confusion errors. There remained however a slight mismatch between group means at the start for the metrics lateral and up-down confusion errors. This small yet significant offset, was in favor of Greverb (difference of % 3°and 3% resp.). Detailed group baseline performance during L0.1 are reported in Table 3 and shown in Figure 6.

Instantaneous impact of room response on
performance (H2) Figure 6 illustrates the evolution of group performance from L0.1 to L0.2, where G-reverb started using the reverberation while G-anech remained in anechoic conditions, prior to any training. A slight overall improvement trend can be observed, likely due to procedural training. However, neither group showed any significant improvement on angular metrics between the two sessions. Baseline group mean differences observed in L0.1 on lateral errors and up-down confusions were also present in L0.2.
G-anech confusion rates did not significantly improve between the two sessions. In contrast, G-reverb combined confusion rate improved from 21.5% to 17.4% (p = 0.046). Consequently, G-reverb combined confusion rate was significantly below the 22.6% rate of G-anech in L0.2 (p = 0.012). No other differences were observed for G-reverb confusion rates between sessions L0.1 and L0.2.
Further L0.2 comparisons revealed that G-reverb frontback confusion rates were significantly higher than that of G-anech (21.0% vs. 14.2% resp., p < 1 ). In the absence of intra-group performance evolution between these two sessions, this result only hints at reverberation having a negative impact on immediate front-back confusions. This observation could also simply be the result of the confusion classification method "redistributing" G-reverb errors from combined and up-down confusions (decreasing) to frontback confusions (increasing) between the two sessions.

Predicting relative improvement based on initial performance (H3)
Participant relative improvement between L0.2 and L3 as a function of initial L0.2 performance is reported in Figure 7. Both data sets are strongly correlated for lateral angle, front back, and up-down confusion errors (|r| > 0.5), arguing in favor of H3. Additionally, scattered values show the overall even distribution of participants improvement during training. Negative improvements on front-back and up-down confusions, i.e. points with y-axis values above zero, are a result of the confusion classification method redistributing errors between categories, as the sum of the 4 categories always tabulates to 100%. Participants did Table 2. Correlation coefficients between participant behavior during the HRTF selection task and initial localization performance. Coefficients whose magnitudes are <0. 15  improve overall with regards to confusions after training, as illustrated by the increase in precision confusion scores in Figure 7f.

Impact of room response on performance evolution (H4)
Groups performance evolution from Day 1 to Day 3 on both types of metrics is reported in Figures 8 and 9. For both groups, a steady and significant decrease can be observed on all angle errors throughout training. Less clear-cut, the overall improvement on confusion errors is still readily apparent.
For all but front-back confusions, G-reverb consistently outperformed G-anech from L1 onwards. Setting aside those two metrics for which initial group performance were not evenly balanced (lateral errors and up-down confusions), results strongly suggest that the inclusion of the room response had a positive impact on performance evolution during training.

Impact of HRTF selection method and training program: comparison to previous studies (H5 & H6)
Results of the current study are compared to those of three previous experiments concerned with HRTF selection and HRTF training in anechoic conditions. The following brief descriptions for each of those experiments highlights those elements relevant to the current comparison (while not covering the entirety of those studies).
Exp-Parseihian [6]. Study on HRTF accommodation in which test group G3, composed of 5 participants, trained with their best-match HRTF, selected from the same subset as the one used in the present study. The selection was based on participant ratings of the precision of audio trajectories created with each HRTF, referred to as the trajectory method. Training consisted of a 12 min game, inducing HRTF adaptation by coupling proprioceptive feedback and passive listening. Designed to be compatible with studies with visually impaired individuals, no visual interface or feedback was provided. Training schedule supposed one training session per day for three consecutive days. Each training session was followed by a performance evaluation based on a localization task. G3 participant performance evolution was compared to that of G1, a control group consisting of 5 participants, also using their best-match HRTF but undertaking only the first training session.
Exp-stitt [19]. Study on HRTF accommodation in which test groups W4 and W10, composed of a total of 16 participants, trained with their worst-match  Table 3: to 54.0% and 57.9% for groups G-anech and G-reverb resp.   Exp-zagala [5]. Study on HRTF quality evaluation in which 28 participants' best-match HRTFs were determined based on two different methods. The first method was the trajectory method, described above. The second relied on the results of a basic localization task, similar to that described in Section 4.5, referred to as localization method. Seven of the 8 HRTFs evaluated were from the same subset as that used in the present study. Each task was repeated three times to examine repeatability.

Impact of the HRTF selection method on initial performances (H5)
The average duration of the HRTF selection stage for each experiment is reported in Table 4. The selection method used in the present study, referred to as the active method, required on the order of half the time required by the other three (based on the duration reported in expzagala). It must be noted that this method was designed to select only a unique bestmatch HRTF, while the other methods yield a full ranked list of the presented HRTFs. Figure 10 illustrates the localization performance of participants immediately after the HRTF selection in all three experiments. Participants using the proposed active method significantly outperformed those using the trajectory method, but for the front-back error rate of group zagala traj. As detailed in Figure 10, results related to the group zagala traj are presented on a per-metric best-case outcome of the two tested methods. Participants using the proposed active method nevertheless significantly outperformed previous cited studies on initial great-circle, up-down confusion, and combined confusion errors. Table 4. Mean duration of the HRTF selection sequence reported in each experiment. HRTF selection duration as reported in similar studies have been added for reference. Methods marked 1 2 3 provide a full ranking of an HRTF set (e.g. trajectory method), the remainder are designed for the selection of a unique best-match HRTF (e.g. active method). Indicated durations have been divided by the number of task repetitions compared to total reported duration in each experiment (e.g. 3 in exp-zagala) for accurate comparison.

Impact of the training protocol on performance evolution (H6)
The evolution of exp-Parseihian and exp-current participant performance across training sessions is illustrated in Figures 11 and 12. The linear regression fit applied on participant results is of the form ax + b, as per Majdak et al. [28]. The low goodness of fit values reported on both figures are a direct result of large inter and intra participant variance. Regression coefficients are still judged meaningful as long as session ID proved to have a significant impact on the metric under consideration. Regression-related elements are omitted from Figures 11 and 12 when this condition is not met.
Results on initial performance coefficient b reflect those reported in Section 5.7.1: exp-current participants started training with lower great-circle and lateral angle errors than those of exp-Parseihian. A direct comparison of the improvement rates coefficient, a, suggests that there is no significant difference between training methods, except for lateral error evolution. Comparison of 95% CI overlap of coefficient a with zero, however, reveals that expcurrent participants alone improved on polar folded error and combined confusion rates. A similar comparison suggests that exp-Parseihian participants significantly improved on front-back confusions where exp-current did not, a result mitigated by unbalanced participants initial performance as discussed in Section 6.
Pairwise post-hoc comparison p-values are shown on Figures 11 and 12 for those distributions showing significant session Â experiment interactions. Having a slightly better initial performance and improving at the same rate, expcurrent participants outperformed those of exp-Parseihian on great-circle and lateral error for all sessions. Having comparable initial performance and improving faster, expcurrent participants outperformed those of exp-Parseihian on polar folded confusion rates from session L2 onwards.

Discussion
The comparison of participants active involvement during the HRTF selection with their initial localization performance in Section 5.2 did not argue in favor of H1.
The comparison of G-reverb and G-anech initial performance in Section 5.4 does not argue in favor of H2. The only  significant, if somewhat small impact of reverberation on performance observed prior to training was on combined confusions, reduced to % 17% compared to % 21% in the initial anechoic condition.
Results presented in Section 5.5 argue in favor of H3, which can be interpreted as participants who improved the most are those who started the worst, having more room for improvement. This observation was previously made in Stitt et al. [19] and Majdak et al. [28], underlying the logarithmic progression model proposed in the latter. It is important to note that no extreme cases were observed where a participant failed to improve, as reported by Stitt et al. This is most likely due the context of that study, focusing on worst-case scenario training with a worst-match HRTF, while the current study employed a rapid bestmatch HRTF training protocol. These results again emphasize the importance of suitable HRTF matching even when training.
The comparison of G-reverb and G-anech performance evolution in Section 5.6 clearly supports H4. Despite the short duration of the training, some of the metrics monitored tend to plateau from L2 onwards (lateral, polar folded, and combined confusions in Figs. 8 and 9). Comparison of the plateaued values suggests that training under reverberant condition could, beyond accelerating learning rates, lead to better long-term overall performance. Two or more additional training sessions would be required to assert this extension of H4.
The comparison of performance prior to training in Section 5.7.1 argues in favor of H5. The proposed HRTF selection method performed surprisingly well, even compared to the post-hoc per-metric best-case zagala loc reference. The authors believe that beyond providing a good-match HRTF, the main advantage of the proposed method is that it intuitively confronts users to the impact of HRTF matching on spatial perception. Coupled with a selection task suited to novices and experts alike (not fastidious for the former, allowing the later precise testing), this method somewhat led participants to actively work on their learning strategy, framing the whole training as an opportunity to improve on a valuable game skill.
Finally, the inter-experiment comparison of performance evolution in Section 5.7.2 supports H6. When expcurrent participants started with performance on par to those of the other experiments, they either learned at faster (polar folded errors, combined confusions) or similar rates (up-down confusions). When exp-current participants started with better performance, they learned at similar (great-circle and lateral errors) or slower rates (front-back confusion). Pending further testing, this last result, but for front-back confusions, argues in favor of H6, as it has been shown that long-term evolution followed a logarithmic progression [28], i.e. that the more room for improvement participants have, the faster they tend to learn.
According to informal interviews and observation of participants during the experiment, one of the advantages of this training program is that it coupled small and focused training sequences (scenarios) with an explicit scoring feedback. The authors believe that, with the dynamic adjustment of generated configurations targeting localization conditions where participants showed difficulty, these elements were key to the observed training efficiency.
As discussed in Section 4.3, the second-order Ambisonic reverberation and sparse 20-point RIR grid were adopted as a trade-off between spatial precision and processing power requirement. Coupled to the artifacts of linear interpolation [69], these design choices limit the scope of the results. Additional work is required to characterize the impact of these choices, including the prescribed reverberation time and room shape, to further investigate the impact the room acoustic response had on learning efficiency.

Conclusion
This paper presented the results of a perceptual study designed to assess a novel HRTF selection method and training program, conceived to reduce the time required to obtain acceptable binaural localization performance. The 24 participants of the experiment started by selecting a non-individual HRTF from an existing database, with which they trained during three 12 min sessions. The rapid selection method employed an active exploration by participants where they judged audio-visual-proprioceptive coherence as a function of HRTF. Participants were divided into two groups, training under either anechoic (control) or reverberant conditions, to assess whether additional audio spatial cues provided via a 3D room response improved learning rates.
Analysis of initial localization performance (prior to training) indicated that the 5 min active HRTF selection method led to localization performance as good as, if not better than, previously suggested methods [5]. Analysis of participant evolution under anechoic conditions indicated that the training program led to improved learning rates compared with that of previous studies [6]. Finally, comparisons between group performance showed that the proposed self-coherent, subtle scene-agnostic room acoustic response accelerated non-individual HRTF learning compared to anechoic conditions.