A Bayesian model for human directional localization of broadband static sound sources

– Humans estimate sound-source directions by combining prior beliefs with sensory evidence. Prior beliefs represent statistical knowledge about the environment, and the sensory evidence consists of auditory features such as interaural disparities and monaural spectral shapes. Models of directional sound localization often impose constraints on the contribution of these features to either the horizontal or vertical dimension. Instead, we propose a Bayesian model that ﬂ exibly incorporates each feature according to its spatial precision and integrates prior beliefs in the inference process. The model estimates the direction of a single, broadband, stationary sound source presented to a static human listener in an anechoic environment. We simpli ﬁ ed inter-aural features to be broadband and compared two model variants, each considering a different type of monaural spectral features: magnitude pro ﬁ les and gradient pro ﬁ les. Both model variants were ﬁ tted to the baseline performance of ﬁ ve listeners and evaluated on the effects of localizing with non-individual head-related transfer functions (HRTFs) and sounds with rippled spectrum. We found that the variant equipped with spectral gradient pro ﬁ les outperformed other localization models. The proposed model appears particularly useful for the evaluation of HRTFs and may serve as a basis for future extensions towards modeling dynamic listening conditions.


Introduction
When localizing a sound source, human listeners have to deal with numerous sources of uncertainty [1]. Uncertainties originate from ambiguities in the acoustic signal encoding the source position [2] as well as the limited precision of the auditory system in decoding the received acoustic information [3,4]. Bayesian inference describes a statistically optimal solution to deal with such uncertainties [5] and has been applied to model sound localization in various ways [6][7][8][9].
Typical approaches of sound localization models rely on the evaluation of several spatial auditory features. Headrelated transfer functions (HRTFs) describe the spatially dependent acoustic filtering produced by the listener's ears, head, and body [10] and have been used to derive spatial auditory features. The way to extract those features is a matter of debate. In particular, a large variety of monaural spectral-shape features have been studied [11][12][13][14][15][16][17], with spectral magnitude profiles [14,17] and spectral gradient profiles [12,15] being the most established ones. Despite such details, there is consensus that the interaural time and level differences (ITDs and ILDs) [1] as well as some form of monaural spectral shapes are important features for the directional localization of broadband sound sources [18].
In order to decode the spatial direction from the auditory features, models rely on the assumption that listeners have learned to associate acoustic features with spatial directions [13,19]. Interaural features are particularly informative about the lateral directions (left/right) and ambiguous with respect to perpendicular directions along sagittal planes (top/down and front/back) for which the monaural spectral features are more informative [18]. This phenomenon led to propose a variant of the spherical coordinate system, the so-called interaural-polar coordinate system where the two poles are placed on the interaural axis [20]. In this article, we use the so-called modified interauralpolar coordinate system, with the lateral angle describing a source along the lateral directions ranging from À90°(left) to 90°(right) and the polar angle describing a source along a sagittal plane in the interval ranging from À90°(bottom), via 0°(eye level, front), 90°(top), and 180°(back) to 270° [ 21]. Directional sound-localization studies typically use the interaural-polar coordinate system to separate the effects of the interaural and monaural features [22,23]. However, this separation is a simplification. For example, monaural spectral features can also contribute to the direction estimation along the lateral dimension [24][25][26]. Hence, directional sound-localization models may provide better predictions when jointly exploiting the information encoded by all auditory features.
Such joint information has already been considered in a model of directional sound localization based on Bayesian inference [6]. This model computes a spatial likelihood function from a precision-weighted integration of noisy acoustic features. Then, the perceived source direction is assumed to be at the maximum of that likelihood function. While this model was built to assess which spatial information can be accessible to the auditory system, its predictions overestimate the actual human performance yielding unrealistically low front-back confusion rates and localization errors [27]. Still, to model human performance, this model can serve as a solid basis for improvements such as the consideration of monaural spectral features, the integration of response noise involved in typical localization tasks, and the incorporation of prior beliefs.
Prior beliefs are essential in the process of Bayesian inference because they reflect the listener's statistical knowledge about the environment, helping to compensate for uncertainties in the sensory evidence [28]. For example, listeners seem to effectively increase precision in a frontal localization task by assuming source directions to be more likely located at the eye-level rather than at extreme vertical positions [8]. However, such an increase in precision may come at the cost of decreasing accuracy. As it seems, the optimal accuracy-precision trade-off in directional localization depends on the statistical distribution of sound sources [29]. While listeners seem to adjust their prior beliefs to changes in the sound-source distribution [29,30], they may also establish long-term priors reflecting the distribution of sound sources in their everyday environment.
Here, we introduce a Bayesian inference model to predict the performance of a listener estimating the direction of static broadband sounds. The model implements a noisy feature extraction and probabilistically combines interaural and monaural spatial features [6]. We limit our model to an anechoic auditory scene with a single broadband and stationary sound source, without listener and source movements. In this scenario, the representation of monaural features requires to account for spectral information [8,15] while interaural features computation can rely on broadband estimators [18,31]. Despite such simplifications, the model structure can easily include more complex processing of interaural features as required for narrow-band stimuli or reverberant and multisource environments (e.g. [32]). Our model computes the likelihood function by comparing the features with templates (i.e. spatial features obtained from listener-specific HRTFs). Subsequently, the probabilistic representation of the sound direction results from combining the sensory evidence with prior beliefs. For simplicity, we consider static prior emphasizing directions at the eye level [8], while the model structure can easily integrate future extensions towards more flexible, task-dependent priors. A Bayesian decision function estimates the source position from the resulting spatial representation. As a last step, the model incorporates response scattering to account for the uncertainty introduced by pointing response in localization experiments [15].
For evaluation, we considered a model variant based on spectral amplitudes and a model variant based on spectral gradients [6]. The model's parameters were fitted to the sound-localization performance of individual listeners [23]. We then tested the simulated responses of both model variants against human responses from sound-localization experiments investigating the effects of non-individual HRTFs [22] and ripples in the source spectrum [33].
The paper is organized as follows: Section 2 describes the auditory model (Sect. 2.1) and explains the parameter estimation (Sect. 2.2). Then, Section 3 evaluates the model's performance by comparing its estimations to the actual performance of human listeners. Finally, Section 4 discusses the model's relevance as well as its limitations, and outlines its potential for future extensions.

Model description
The proposed auditory model consists of three main stages, as shown in Figure 1: 1) The feature extraction stage determines the encoded acoustic spatial information represented as a set of spatial features altered by noise; 2) The Bayesian inference integrates the sensory evidence resulting from the decoding procedure based on feature templates with the prior belief and forms a perceptual decision; and 3) The response stage transforms the perceptual decision in a directional response by corrupting the estimation with uncertainty in the pointing action.

Feature extraction
The directional transfer function transformed in the time domain (i.e. the HRTF processed to remove the direction-independent component [34]) convolved with the sound source provides the sensory evidence, which is represented by the spatial auditory features. We follow [6] in that we decode the spatial information provided by a single sound source via the binaural stimulus from a vector of spatial features: where x itd denotes a scalar ITD feature, x ild a scalar ILD feature, and a vector that concatenates monaural spectral features for left ear, x L,mon , and right ear, x R,mon . Each feature is assumed to be extracted by different neural pathways responsible to deliver encoded spatial information to higher levels of the auditory system [1,4]. Simple broad-band estimators approximated the interaural features [18,31] because we limited our evaluation to the task of localizing a broadband and spatially static sound source in an acoustic free field. More complex representations of interaural features are required when considering more natural listening conditions, e.g., reverberant spaces, multi-talker environments, sounds embedded in noise, or sounds with spectro-temporal fluctuations such as speech or music. In our simple listening scenario, the ILD was approximated as the time-averaged broadband level difference between the left and right channels [18]. The ITD was estimated by first processing each channel of the binaural signal with a low-pass Butterworth filter (10th order and cutoff 3000 Hz) and an envelope extraction step based on the Hilbert transform. Then, the ITD was computed with the interaural cross-correlation method which is a good estimator of perceived lateralization in static scenarios with noise bursts [31]. In addition, we applied the transformation proposed by Reijniers et al. [6] to compensate the increasing uncertainty levels for increasing ITDs [35] resulting in a dimensionless quantity with a more isotropic variance: with "itd" denoting ITDs in ls and the parameters a itd = 32.5 ls and b itd = 0.095 and "sgn" indicating the sign function (for details on the derivation based on signal detection theory, see Supplementary Information from [6]). An example of the interaural features as functions of the lateral angle is shown in Figure 2.
Monaural spectral features, x {L,R},mon , were derived from approximate neural excitation patterns. To approximate the spectral resolution of the human cochlea, we processed the binaural signal by the gammatone filterbank with non-overlapping equivalent rectangular bandwidths [36,37], resulting in N B = 27 bands within the interval [0.7, 18] kHz [38,39]. Followed by half-wave rectification and square-root compression to model hair-cell transduction (e.g., [32,40]), it resulted in the unit-less excitation: where subscripts f 2 {L, R} indicate the left and right ears, n = 1, . . ., N is the time index, b = 1, . . ., N B is the band index, g b [n] is the corresponding gammatone filter and h u f ½n is the binaural signal in a normalized scale with sound direction u (i.e. a pair of head-related impulse responses or their convolution with a source signal).
We thus defined the spectral feature for the magnitude profiles (MPs) with the vector x f,MP . This vector is the collection of root mean square amplitudes across time in decibels for each of the spectral bands for each ear: where the function c u f;b ½n is defined in equation (3). Positive gradient extraction over the frequency dimension can be computed as an alternative spectral feature since its integration increases the agreement between human localization performance and the model's predictions [15]. Therefore, we defined a second possible spectral feature based on gradient profiles (GPs) with the vector x f,GP . It includes the gradient extraction as an additional processing step: Example for subject NH12 [23]. otherwise; A visualization of these monaural features is shown in Figure 3.
To demonstrate the impact of monaural spectral feature type, we analyzed the results of both variants with the corresponding feature spaces defined as follows: Limited precision in the feature extraction process leads to corruption of the features and can be modelled as additive internal noise [6]. Hence, we defined the noisy internal representation of the target features as: where R is the covariance matrix of the multivariate Gaussian noise. Furthermore, we assumed each spatial feature to be processed independently and thus to be also corrupted by independent noise [1]. Hence, the covariance matrix R definition is: with r 2 itd and r 2 ild being the variances associated with the ITDs and ILDs and r 2 mon I being the covariance matrix for the monaural features where I is the identity matrix and the scalar r mon represents a constant and identical uncertainty for all frequency bands.

Bayesian inference
The observer infers the sound direction u from the spatial features in t while taking into account potential prior beliefs about the sound direction. Within the Bayesian inference framework [5], this requires to weight the likelihood p(t|u) with the prior p(u) to obtain the posterior distribution by means of Bayes' law as: The likelihood function was implemented by comparing t with the feature templates. The template T(u) contains noiseless features of equation (1) for every sound direction u [6]. While the sound direction was defined on a continuous support, our implementation sampled it over a quasi-uniform spherical grid with a spacing of 4.5°between points (N u = 1500 over the full sphere). Template features were computed from the listener-specific HRTFs.
To accommodate non-uniform HRTF acquisition grids, we performed spatial interpolation based on spherical harmonics with order N SH = 15, followed by Tikhonov regularization [41].
Since the templates were constructed without noise, there exists a one-to-one mapping between direction and template features. This allowed us to write the likelihood function for each point of the direction grid as: where R represents the learned precision of the auditory system (i.e. the sensory uncertainty d reported in Eq. (7)). Finally, we interpreted the a-priori probability p(u) to reflect long-term expectations of listeners where prior probabilities were modelled as uniformly distributed along the horizontal dimension but centered towards the horizon as [8]. In particular, we extended the results from Ege et al. for sources positioned in the front and as well as back positions with: with denoting the elevation angle of u and r 2 P; the variance of the prior distribution [8]. For simplicity, the prior definition was based on the spherical coordinate system. Importantly, the origin of that prior is currently unknown and its implications are discussed in Section 4.
According to equation (9), a posterior spatial probability distribution was computed for every sound by optimally combining sensory evidence with prior knowledge [28]. As shown in Figure 4, the most probable direction of the source u was then selected as the maximum a-posteriori (MAP) estimate:û

Response stage
After the inference of the sound direction, experiments usually require the listener to provide a motor response (e.g. manual pointing). To account for the uncertainty introduced by such responses, we incorporated postdecision noise in the model's response stage. Following the approach from previous work [15], we blurred the location estimate by an additive, direction-independent (i.e., isotropic) Gaussian noise: where m~vMF(0, j m ) is a von-Mises-Fisher distribution with zero mean and concentration parameter j m . The concentration parameter j m can be interpreted as a standard deviation . The contribution of the response noise is visible in Figure 4, where the final estimate was scattered independently of the spatial information provided by the a-posteriori distribution. With equation (13), the model outputs the response of the estimated sound source direction.

Parameter estimation
The model includes the following free parameters: r ild , r mon (amount of noise per feature; r itd was fixed to 0.569 as in [6]), r P, (directional prior), and r m (amount of response noise). Because of the model's structure, these parameters jointly contribute to the prediction of performance in both lateral and polar dimensions. To roughly account for listener-specific differences in localization performance [2], the parameters were fitted to match individual listener performance.
As for the objective fitting function, we selected a set of performance metrics widely used in the analysis of behavioral localization experiments [22,23,42], for a summary see [43]. A commonly used set of metrics contains the quadrant error rate (QE, i.e., frequency of polar errors larger than 90°), local polar errors (PE, i.e., root mean square error in the polar dimension that are smaller than 90°, limited to lateral angles in the range of ±30°), and lateral errors (LE, i.e., root mean square error in the lateral dimension) [22]. We accounted for the inherent stochasticity of the model estimations by averaging the simulated performance metrics over 300 repetitions of the N u = 1550 directions in the HRTF dataset (i.e., Monte-Carlo approximation with 465,000 model simulations). Model parameters were jointly adjusted in an iterative procedure (see below) until the relative residual between the actual performance metric E a and the predicted performance metric E p was minimized below a metric-specific threshold s E , i.e., We set the thresholds to s LE = 0.1, s PE = 0.15, and s QE = 0.2 because these values were feasible for all subjects. In addition, the QE was transformed with the rationalized arcsine function to handle small and large values adequately [44]. We ran the estimation procedure separately for each feature space in equation (6) and each listener. First, initial values of the parameters were derived from previous literature: the variance of the prior distribution was set to r P, = 11.5°as in [8]. The interaural feature noise was set to r ild = 1 dB, reflecting the range of ILD thresholds for pure tones [45]. The starting value for the monaural feature noise was set to r mon = 3.5 dB as in [6]. The response noise standard deviation was set to r m = 17°as the sensorimotor scatter found in [15]. Second, in an iterative procedure, r m was optimized to minimize the residual error relative to the PE metric and, similarly, r mon was adjusted to match the QE metric. Then, r ild was decreased to reach the LE metric. These steps were reiterated until the residual errors between actual and simulated metrics was less than the respective threshold. This procedure limited the r m to the interval [5°, 20°] and used a step-size of 0.1°, r mon was defined in the interval [0.5, 10] dB with a step-size of 0.05 dB; r ild was defined in the interval of [0.5, 2] dB with a step-size of 0.5 dB. If the procedure did not converge, we decreased r P, by 0.5°and reattempted the parameter optimization procedure.

Results
We first report the quality of model fits to the calibration data itself [23] in Section 3.1. Then, Section 3.2 quantitatively evaluates the simulated performances of our two model variants and of two previously proposed models against data from two additional sound localization experiments.

Parameter fits
We run the parameter estimation procedure for both model variants, based on either t MP or t GP , and for five individuals tested in a previous study [23]. In that experiment, naive listeners were asked to localize broadband noise bursts of 500 ms duration presented from various directions on the sphere via binaural rendering through headphones based on listener-specific directional transfer functions. The subjects were wearing a head-mounted display and were asked to orient the pointer in their right hand to the perceived sound-source direction. The fitting procedure converged for both models and all subjects. Notably, subject NH15 required to reduce the step size of r m to 0.1°to meet the convergence criteria. Table 1 reports the estimated parameters r m , r P, , r mon and r ild for every listener. The amount of response noise was similar for both model types. Table 2 contrasts the predicted performance metrics with the actual ones, averaged across listeners.
More in detail, Figure 5 compares predicted localization performance to the actual performance of subjects estimating the direction of a noise burst for different spherical segments [23]. The predicted LEs and PEs, both as functions of the actual lateral and polar angles, respectively, were in good agreement with those from the actual experiment. Instead, the simulated QE metric failed to mimic the front back asymmetries present in four subjects. Finally, only small differences were observed between the two feature spaces t MP and t GP . Figure 6 illustrates the effects of different model stages on target-specific predictions. The example shows direction estimations from subject NH16 localizing broadband noise bursts [23] and the corresponding predictions of the model based on t GP with different configurations of priors and response noise: without both (a), with priors only (b), and with both (c). While adding response noise scattered the estimated directions equally across spatial dimensions (compare c to b), including the spatial prior only affected the polar dimension (compare b to a). As observed in the actual responses, the prior caused more of the simulated estimations to be biased towards the horizon (0°and 180°).

Contribution of model stages
In order to quantify the effect of introducing the spatial prior in the polar dimension, we computed the polar gain as a measure of accuracy [13] for both simulated and the actual responses. This metric relies on two regressions performed on the baseline condition, separating between targets in the front and back. The linear fits for the baseline condition were defined as: with / e being the estimated polar angles and / a being the actual polar angles. The parameters were the localization bias b / in degrees, which is typically very small, and the dimensionless localization gain g / , which can be seen as a measure of accuracy [8,13]. The regression fits only incorporated / e that deviate from the regression line by less than 40°. Since that definition of outliers depended on the regression parameters, this procedure was initialized with b / = 0°and g / = 1 and re-iterated until convergence. In our analysis, only the frontal positions were considered. The polar gain of the actual responses, averaged over subjects, was 0.50, indicating that our subjects showed a localization error increasing with the angular distance to the horizontal plane. For the models without the prior, the predicted polar gain was 1.00 (Fig. 6a). The polar gain obtained by the model including the prior was 0.62 (Figs. 6b and 6c) showing a better correspondence to the actual polar gain. Hence, the introduction of the prior belief improved the agreement with the actual localization responses by biasing them towards the horizon.

Model evaluation
The performance evaluation was done at the grouplevel. For our model, we used the five calibrated parameter sets with templates T(u) based on the individuals' HRTFs as "digital observers". Group-level results of these digital observers were then evaluated for two psychoacoustic experiments with acoustic stimuli as input that differed from the baseline condition with a flat source spectrum and individual HRTFs. In addition, we compared our results with the ones of two previously published models. The first one, described by Reijniers et al. [6], is probabilistic and able to jointly estimate the lateral and polar dimensions similar to the model described in this work. Reijniers' model deviates from ours because it relies on a different feature extraction stage, uses a uniform spatial prior distribution, does not include response noise (Eq. (13)), and does not fit individualized parameters. The second model, described by Baumgartner et al. [15], estimates sound positions only in the polar dimension. Nevertheless, it shares a similar processing pipeline with our model since both consider a perceptually relevant feature extraction stage, response noise, and individualized parameters. The main differences with our model are the incorporation of a directional prior, the integration of the lateral dimension, and a different method to compute the sensory evidence. Notably, this previous work implemented the template comparison procedure with the l 1 -norm, which is substantially different from our likelihood function in equation (10). At the moment, the Baumgartner et al. model is commonly used by the scientific community interested in predictions of the elevation perception based on monaural spectral features (e.g., [46,47]). We refer to these two models as reijniers2014 and baumgartner2014, respectively.

Effects of non-individual HRTFs
In first evaluation, sounds were spatialized using nonindividualized HRTFs [22]. Originally, eleven listeners localized Gaussian white noise bursts with a duration of 250 ms and sound directions were randomly sampled from the full sphere. Subjects were asked to estimate the direction of sounds that were spatialized using their own HRTFs in addition to sounds that were spatialized using up to four HRTFs from other subjects (21 cases in total). With the aim to reproduce these results, we had our pool of five digital listeners localize sounds from all available directions that were spatialized with their own individual HRTFs (Own) as well as sounds that were spatialized with HRTFs from the other four individuals (Other). We thus considered all inter-listener HRTF combinations for the non-individual condition. Figure 7 summarizes the results obtained for localization experiments with own and other HRTFs. In the Own condition, there was a small deviation between the actual results from [22] and our model predictions. This mismatch reflects the fact that the digital observers represented a different pool of subjects (taken from [23]) tested on a slightly different experimental protocol and setup. Differences in performance metrics were small between the two feature spaces, as already reported during parameter fitting. Predictions from the baumgartner2014 model are only possible for Figure 5. Sound-localization performance as function of the direction of a broadband sound source. Open symbols: Predictions obtained by the two model's variants based on either spectral magnitude profiles (MPs) or gradient profiles (GPs). Filled grey symbol: Actual data from [23]. Top row: Lateral error, calculated for all targets with lateral angles of À65°± 25°, À20°± 20°, 20°± 20°, and 65°± 25°. Center and bottom rows: Polar error and quadrant error rates, respectively, calculated for all median-plane (±30°) targets with polar angles of 0°± 30°, 90°± 60°, and 180°± 30°. the polar dimension. Instead, the model reijniers2014 predicted too small errors, as also observed in previous simulations employing this model [27,48].
In the Other condition, both of our model variants predicted a smaller degradation for the lateral dimension as compared to the actual data. The lateral errors predicted by reijniers2014 increased moderately but remained too small in comparison to the actual data. In the polar dimension, both model variants resulted in increased PEs and QEs, but the amount of increase was larger and more similar to the actual data for the variant equipped with gradients profiles, especially with respect to QE. As expected, the predictions from baumgartner2014 were similar to the model based on spectral gradients, given the similar method of extracting the monaural spectral features. The simulation results for reijniers2014 agreed with the super-human performance described in [27].

Effects of rippled-spectrum sources
The second evaluation tested the effect of spectral modulation of sound sources on directional localization in the polar dimension [33]. In that study, localization performance was probed by using noises in the frequency band [1,16] kHz, which spectral shape were distorted with a sinusoidal modulation in the log-magnitude domain. The conditions considered different ripple depths, defined as the peak-to-peak difference of the log-spectral magnitude, and ripple densities, defined as the sinusoidal period along the logarithmic frequency scale. The actual experiment tested six trained subjects in a dark, anechoic chamber listening to the stimuli via loudspeakers. The sounds lasted 250 ms and were positioned between lateral angles of ±30°and polar angles of either 0 ± 60°for the front or 180 ± 60°for the back. A "baseline" condition included a broadband noise without any spectral modulation (ripple depth of 0 dB). To quantify the localization performance, we used the polar error rate (PER) as they defined [33]. For every condition, two baseline regressions were computed as in Section 3.1 allowing us to quantify the PER as the ratio of actual responses deviating by more than 45°from the predicted values of the baseline regression. Figure 8 shows the results of testing the fitted models with rippled spectra. In the baseline condition, our model exhibited similar performances to those obtained in the actual experiment, whereas baumgartner2014 underestimated the baseline performance for this particular error metric. In the ripple conditions, actual listeners showed the poorest performance for densities around one ripple per octave and a systematic increase in error rate with increasing ripple depth. The model variant based on gradient profiles predicted these effects well, similar to the predictions from baumgartner2014. In contrast, both reijniers2014 and the variant based on magnitude profiles were not able to reflect the effects of ripple density and depth as present in the actual data. Hence, the positive gradient extraction appears to be a crucial processing step for predicting sagittal-plane localization of sources with a non-flat spectrum.

Discussion
The proposed functional model aims at reproducing listeners' performances when inferring the direction of a broadband sound source. The model formulation relies on Bayesian inference [28] as it integrates the sensory evidence for spatial directions obtained by combining binaural and monaural features [13] with a spatial prior [8]. Our approach considers uncertainties about the various sensory features, as in [6], in addition to the noise introduced by pointing responses [15]. These components enabled us to  . Localization performance with individual (Own) and non-individual (Other) HRTFs. Actual data from [22] (data_middlebrooks1999). Model predictions for two model variants: spectral magnitude profiles (MPs) and spectral gradient profiles (GPs). As references, predictions by the models reijniers2014 [6] and baumgartner2014 [15] are shown. Note that baumgartner2014 does not predict the lateral error. Figure 8. Effect of spectral ripples in the source spectrum on sound localization performance in the median plane. Right-most bottom panel: localization error rates obtained without spectral ripples serving as reference. Top and bottom left panels: Differences to the reference condition shown in the right-most bottom panel. In addition to predictions from the two model variants (MP and GP), predictions from reijniers2014 [6] and baumgartner2014 [15] as well as actual data from [33] (data_macpherson2003) as shown. Note that ripple densities were varied at a fixed ripple depth of 40 dB and ripple depths were varied at a fixed ripple density of one ripple per octave. successfully match model predictions with the actual subject's performance by means of overall performance metrics (LE, PE, and QE) for five subjects (see Tab. 2) within spatially restricted areas (Fig. 5). With the inclusion of a spatial prior, the model was able to adequately explain listeners' response biases towards the horizontal plane. Compared to previous models [6,15], our model better predicted the group-level effects of non-individualized HRTFs and rippled source spectra, yet only when selecting positive spectral gradient profiles as monaural spectral features.
We evaluated the model in scenarios where the subject was spatially static and listening to a broadband and spatially static sound source in an acoustic-free field. In this scenario, we simplified the representation of interaural features to broadband ITDs and ILDs [18,31]. Extensions to the features are required when applying our model to more complex sounds, such as music or speech (e.g., [49]). Also, modeling sound localization in a multi-talker environment requires a frequency-dependent extraction of the interaural cues by considering temporal fine-structure within a restricted frequency range or temporal envelope disparities [32]. Similarly, modeling the localisation of sound sources placed in reverberant spaces necessitates more complex feature-specific cochlear processing to account for phenomena like the precedence effect [50,51]. Still, our model structure is open to integrating such extensions in the future. In its present form, the model is ready to assess the degree of HRTF personalization by comparing the predicted sound-localization performance obtained with one HRTF set against a set of listener-specific HRTFs [52].
In our model, the MAP decision rule selects the most probable source position posterior distribution. We preferred this estimator over the mean estimator to adequately deal with multiple modes of the posterior distribution generated by front-back confusions along sagittal planes. On the other hand, the MAP estimator disproportionately biases direction estimates towards the prior's mode, at least under conditions of high sensory uncertainty. One of many possible alternative estimators that may better describe the stochastic human localization responses is the posterior sampling [8]: the model samples the perceptual estimate from the full posterior distribution. Although often considered suboptimal, this estimator would allow the observer to adapt to novel environmental statistics [53]. However, a different estimator might affect the fitted sensory and motor noise parameters. Therefore, comparative evaluations of different estimators would require a more robust fitting procedure, which is outside the current study's scope.
The model incorporates several non-acoustic components because they are crucial in explaining human performances [2,54]. Extending the reijniers2014 model [6] by incorporating a spatial prior and response scatter appeared vital to explain listeners' response patterns. Without these components, fitting the model to the polar performance metrics was unfeasible [27]. First, response noise allowed us to control the response precision locally (LE and PE) while leaving the global errors (QE) unaffected. The global errors depend predominantly on the variance of noise added to the monaural features. Second, the spatial prior shapes the response patterns by introducing a bias towards the horizon [42]. As shown in Figure 6, the prior contributed to the polar component of the simulated responses, which clustered around the eye-level direction. The polar gain measure generated additional evidence, as reported in Section 3.1, where integrating the prior beliefs led to better matching the performances in the vertical dimension. We extrapolated the spatial prior's formulation from [8] by assuming a symmetric prior distribution between front and back positions. Discrepancies observed between actual and predicted global errors (Fig. 5) indicate that this assumption was likely incorrect and points towards a front-back asymmetric prior instead. Nonetheless, we can only speculate about the reasons behind such a long-term prior in spatial hearing. It might reflect the spatial distribution of sound sources during everyday exposure [55], it may stem from an evolutionary emphasis on high relevance auditory signals [4], or it could be related to the centre of gaze as observed in barn owls [56], although the processing underlying the spatial inference mechanism might be different in mammals [3].
While the model only considers spatially static listening scenarios, it sets a foundation for future work on predicting sound-localization behavior in realistic environments. For example, modeling the environment's dynamics as a chain of consecutive stationary events is promising (e.g., [7]). Sequential update of listener's beliefs by considering the posterior as the next prior appears to be a natural mechanism under the Bayesian inference scheme [28]. Our model is a well-suited basis for such investigations. A rich set of modulators might influence the mechanism of spatial hearing, and the model's prior belief is the entry point to account for many of those like accumulation to track source statistics [20,57], visual influences on auditory spatial perception [58], or auditory attention to segregate sources [59]. Selective temporal integration appears critical when dealing with the spatial information of many natural sources in realistic scenarios. Integrating recent findings on interaural feature extraction might solve this aspect [60]. To this end, the model must account for the dynamic interaction between the listener and the acoustic field. These extensions will potentially enable the model to account for subject movements [9] and simultaneous tracking of source movements [61] while extracting spatial information from echoic scenarios [62].

Conclusions
We proposed a computational auditory model for the perceptual estimation of the direction of a broadband sound source based on Bayesian inference. From a binaural input, the model estimates the sound direction by combining spatial prior beliefs with sensory evidence composed of auditory features. The model parameters are interpretable and related to sensory noise, prior uncertainty, and response noise. Having fitted the parameters to match subject-specific performance in a baseline condition, we accurately predicted the localization performance observed for test conditions with non-individualized HRTFs and spectrally-modulated source spectra. Regarding spectral monaural feature extraction, the model variant based on the spectral gradient profiles performed best.
The proposed model is useful in assessing the perceptual validity of HRTFs. However, the model's domain is currently limited to spatially static conditions, but it provides a good basis for future extensions to spatially dynamic situations, spectrally dynamic signals like speech and music, and reverberant environments.