Open Access
Issue
Acta Acust.
Volume 7, 2023
Article Number 34
Number of page(s) 6
Section Hearing, Audiology and Psychoacoustics
DOI https://doi.org/10.1051/aacus/2023028
Published online 07 July 2023

© The Author(s), Published by EDP Sciences, 2023

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

In model-based binaural synthesis, head-related transfer functions (HRTFs) play a crucial role in creating virtual acoustic scenes. These multi-dimensional datasets usually cover a number of sound incidence directions, which allows for synthesizing virtual sound sources at almost arbitrary positions around the listener. For acquiring individualized HRTF sets, a variety of methods can be applied, such as acoustical measurements (e.g., [15]), sound field simulations (e.g., [610]), spectral reconstruction (e.g., [11, 12]), or adjustment of the magnitude [13] and/or phase [14] of a given generic HRTF set.

Table 1

Global 50th and 95th percentiles of the metric values of combined inter- and intra-individual HRTF comparisons. The 95th percentiles are used for normalization in Figures 2 and 3.

For validation, it is necessary to assess the subjective quality of these acquired HRTF. However, as listening experiments are time consuming and furthermore underlie uncertainties in the participants’ responses, it is often needed to numerically quantify the differences between transfer functions resulting from the different approaches, as well as the variability within each approach. For example, approximation methods require a similarity criterion for a direct cross-comparison of the resulting spectra with their respective target HRTFs. On observing studies of this kind, the lack of a standardized way of capturing HRTF differences becomes apparent. The studies use a variety of distance metrics, which, depending on their properties, detect different aspects of spectral dissimilarity. For example, a varying response of different metrics to an angular mismatch between HRTF spectra was examined in [15]. Still, as no single metric is sufficient to predict and describe audible differences, a higher-level metric is striven for that should combine the benefits of different metrics. The main focus of the present work lies in understanding the interrelation between a selected set of metrics. A case study is performed for this purpose, evaluating an HRTF individualization method (principal component reconstruction) and thereby offering observations for different kinds of spectral deviations.

In this context, four questions are of interest: a) how the approximation errors introduced by the individualization method are reflected in distance metric behavior; and b) how these approximation errors scale relative to the error found between generic HRTFs. A comparison of metric behavior in different cases further gives insight into the redundancy between metrics, hence, Pearson correlation coefficients are examined to address the following questions: c) What are the common trends in metric behavior, and are they case-dependent?; and d) can metric behavior be set in relation with anthropometric deviations? Conclusions are drawn regarding the future definition of a generalized overarching metric.

2 Materials

The following sections introduce the distance metrics considered in the present work as well as the HRTF data used for analysis.

2.1 Distance metric overview

Seven distance metrics with varying properties were selected from or based on literature, covering different categories, as visualized in Figure 1. All of the metrics are of monaural nature, contrasting the transfer function of one ear and a single incidence direction, respectively, with the corresponding transfer function in another HRTF set. Unless otherwise stated, the metrics consider the full frequency range up to 22.050 Hz.

thumbnail Figure 1

Classification of the selected HRTF distance metrics. Based on metric properties, e.g. the frequency resolution, the main arithmetic operation performed, and the consideration of perceptual properties of the auditory system (or lack thereof), different behavior of the metrics can be presumed.

The classic frequency bin-based calculation of the Mean-Squared Error (MSE) metric

(1)

with nBins equal to 129, is extended by a perceptual weighting, introducing the Critical-Band Mean Squared Error (CB) [15]

(2)

with factors , ΔfCB(i) being the critical band width at the respective frequency, and α0 the value that allows the sum of αCB(i) to be equal to one.

The Mel Frequency Cepstral Distortion [16, 17] performs a similar calculation on Mel Frequency Cepstral Coefficients (MFCCs) [18], defined on nCoeff (here: 24) bands as

(3)

Metrics based on mean squares capture overall gain differences between spectra; in contrast, a variance calculation emphasizes changes in spectral shape (SS), regardless of a potential gain offset. The latter category is represented here by the Inter-Subject Spectral Difference (ISSD) [13]

(4)

with the variance calculated on bin-level on Directional Transfer Functions (DTFs, HRTFs normalized by the diffuse field transfer function) for frequencies up to 13 kHz, and by the Loudness Level Spectral Error (LLSE, based on [19])

(5)

with loudness level difference ΔLL calculated in phon for nERB (here: 42) equivalent rectangular bandwidth (ERB) channels, respectively. A direct comparison of broadband spectral energy is provided by the Power Difference (PD)

(6)

thus capturing directional level differences, which would be present even if the user chose to normalize the complete HRTF sets, e.g., based on the level difference for frontal incidence. Finally, the Correlation Distance (CD) metric [20] considers phase in addition to magnitude information, as included in the head-related impulse response (HRIR), yielding

(7)

with the Gammatone (GT) bandpass filters (BP) corresponding to nGT (here: 42) bands.

Merely based on metric properties, certain observations can be expected. In contrast to envelope-based metrics, bin-based calculations should better detect changes in fine spectral structure. By considering the logarithmic frequency characteristic of the inner ear, either through weighting (MSECB) or through the choice of bandwidths in a filter bank (SSLLSE, MSEMFCD and CD), an overrating of errors in the high frequency range is avoided, better accounting for the human auditory perception.

2.2 HRTF reconstruction

As a basis for the present analysis, we used 20 measured HRTF sets from the ITA HRTF database [21]. Only left-ear data were evaluated, which were furthermore limited to elevations between ±45°, thus avoiding over-representation of incidence directions close to the poles, given the 5° × 5° equiangular sampling grid. This resulted in a total of 1296 incidence directions. The transfer functions were additionally approximated using Principal Component Analysis (PCA) reconstruction, as described in [22]. This individualization method comprises two steps: First, the magnitude spectra are expressed as a weighted sum of PCs (here: 23), the ideal weighting score being a direct output of the PCA. Thus reconstructed data are in the following referred to as the “idealPCA” dataset. The second step is an approximation of these ideal weights, using a weighted sum of anthropometric features. Six head and ear dimensions with lowest bi-variate correlation [23] are chosen for this regression model, leading to the following features: {“h”, “d2”, “d3”, “d6”, “d7”, “d8”}, as defined in the HRTF database. The resulting dataset is in the following referred to as “anthroPCA”. Since PCA is here used on magnitude data, reconstructed spectra initially possess no phase information. A minimum phase [24] is added, yielding the correct shape of the HRIR, with the impulse energy properly distributed over time.

3 Results and discussion

In the following, inter- and intra-individual errors are discussed for different datasets, followed by a global comparison of correlation behavior and a discussion of consequences for metric design.

3.1 Inter-individual error

Listening with non-individual spatial cues is relevant whenever only a generic HRTF set is available. This dataset may belong to a dummy head or another person, with a directivity more or less suitable for the listener. Since inter-individual differences between approximated HRTFs are not of practical relevance, the following evaluation is limited to the 20 “measured” HRTFs from the ITA database. Three representative HRTF sets are selected, to be each compared to the remaining 19 sets, respectively. The selection is based on the relative deviation Δa of the individual’s anthropometric features from the mean of the 20 individuals. This deviation is defined for individual i as

(8)

with nFeat (here: 11) features (aj) from the ITA database {“w”, “dF”, “h”, “d1” – “d8”}, and μj being the mean over feature j of the 20 individuals. A closer look at the relation between metric values and Δa, sorting the deviation Δai by ascending order, see Figure 2, reveals that not for all metrics, the value range scales with Δa. Note that the distribution of metric values after normalization by the global (i.e., including the inter-individual subsets below and intra-individual data from Sect. 3.2) 95th percentile of the respective metrics is shown. The box plots represent data within the respective 25th and 75th percentiles. Besides the rather weak tendency for an increase in median values (black dots), the box plots for each individual metric, respectively, show a substantial overlap.

thumbnail Figure 2

Inter-individual HRTF errors for different reference individuals i, sorted by the deviation of their anthropometric features from the database mean, as expressed by Δai (Eq. (8)). The black dots indicate the respective median values.

Nevertheless, representative datasets with the minimum (ID 13), median, i.e., 11th rank (ID 8) and maximum (ID 1) anthropometric deviations are selected for further analysis in Figure 3a. At the top, each of the three selected HRTF sets is contrasted with the other 19 individuals. The black and gray lines correspond to pairs of contra-lateral horizontal plane magnitude spectra at −20° azimuth for ID 13 in direct comparison to IDs 8 and 1, shown in the middle row. Considering the example spectra, the much larger spectral deviations between IDs 13 vs. 8, compared to IDs 13 vs. 1 are striking. The disparity is directly reflected in the corresponding metric values (dark gray line). This rather unexpected behavior can be explained by the fact that the selection of these datasets is based on anthropometric similarity to the mean and does not guarantee higher similarity to individual datasets.

thumbnail Figure 3

Metric behavior for (a) inter-individual errors within the “measured” dataset (with IDs 13, 8, 1 as reference, from left to right) and for intra-individual errors between (b) the “measured” and “anthroPCA”, (c) the “measured” and “idealPCA”, and (d) the “idealPCA” and “anthroPCA” datasets. The box plots (top) visualize the distribution of metric values, normalized by the 95th percentile of each metric. Two example pairs of transfer functions are depicted in the middle row, with an offset of 20 dB introduced for better visibility. Corresponding (normalized) metric values are indicated by the gray and black lines superimposing the box plots. Pearson correlation coefficients with p < 0.05 after Bonferroni correction (bottom) display varying patterns, with most prominent differences present in case (c).

Generally, measured HRTFs possess a high resolution of spectral cues, with sharp direction-dependent peaks and notches. Therefore, a mismatch between spectra with such properties seems to affect most metrics, capturing a variety of errors.

3.2 Intra-individual error

Similarly, intra-individual errors are visualized in Figures 3b3d. Here, data points for 20 subjects ×1296 directions are available for each of the metrics. The example HRTF spectra are again extracted at −20°, yet from IDs 13 and 1 in the different datasets, respectively. Each of the two approximation steps comprised in PCA individualization introduces a different kind of error to the spectra, with both errors superimposing in the “anthroPCA” dataset, i.e., the final output of the individualization method.

Column (c) displays the error caused by the first approximation of measured spectra by PCs. As can be observed in the middle row, the compared spectra are almost identical, except for smoothing of prominent notches. Such deviations, inherently localized in frequency, seem to be well captured by variance-based metrics, particularly SSLLSE. Squared mean-based metrics – due to their averaging nature – are dominated by the spectral similarity in the remaining frequency regions, thus overseeing the localized, though prominent and very likely audible [25], error. Another strong reaction to the observed smoothing effect is present in the CD metric. It can be inferred that the localized spectral error and ensuing changes in minimum phase response significantly modify the shape of HRIRs, affecting the temporal cross-correlation.

Column (d) shows the error introduced by the regression model, which approximates the ideal PC scores based on anthropometric features of the target listener. The behavior is complementary to the previously discussed case (c); the CD metric consistently shows a much smaller error and SSLLSE almost none, while the remaining metrics exhibit a rise. As visible in the example spectra, this approximation step entails generally flatter magnitude spectra with possible disparities in peak and notch frequencies, compared to the spectra reconstructed with ideal scores. Rather than only dampening individual cue resonances, “non-individual” components are introduced to the spectra, prompting a reaction of the metrics that previously also detected inter-individual errors in column (a). Nonetheless, since the spectral cues are not very prominent, the metrics react less strongly.

Column (b) evaluates the overall performance of the individualization method in its typical use. Superposition of the two approximation errors (columns (c) and (d)) can be well observed in the resulting distributions. Most values lie below the inter-individual case (a), where measured HRTFs are contrasted. However, two overshooting metrics (SSLLSE and CD) put forward the question of perceptual relevance of the different metrics. Solely based on the five other distance metrics, individualized HRTFs would be preferred, rather than a generic HRTF set. However, the perceptual effect of the different types of errors should still be taken into account for a correct judgment.

3.3 Correlation analysis

Pearson correlation coefficients (p < 0.05 after Bonferroni correction) were calculated to assess linear relationships between the metrics. Due to outliers towards high metric values, a logarithmic transform was applied beforehand to approximate a normal distribution. As shown in the bottom row of Figure 3, correlation patterns show some variations between use cases. Common trends lie in moderate to strong correlation of MSE with MSECB, and moderate correlation between CD and variance-based metrics, especially SSLLSE. The latter goes in line with the joint reaction of the two metrics in the box plots of columns (b) and (c).

Largest deviations in correlation behavior are visible in column (c), where the “measured” and “idealPCA” datasets are contrasted. An otherwise moderate correlation between the power difference PD and metrics MSECB and MSEMFCD is not present in this case, since level differences are barely found between the spectra. Such a deviation between correlation patterns emphasizes the effect of different types of spectral properties on metric behavior and interrelation.

A weak to moderate negative correlation between CD and metrics MSE and MSECB, especially in column (b), could relate to the level dependency of the linear MSE calculation. Their response to ipsi-lateral errors is boosted, whereas CD tends to be more sensitive to contra-lateral errors, where the first (main) impulse of the HRIR, consisting of a summation of multiple paths around the head, is more temporally spread and of lower level. Later HRIR components, more affected by differences in pinna resonances, therefore tend to be more dominant in CD calculation for contra-lateral incidence.

Another noteworthy observation is the comparatively low correlation within each of the two groups (MSE and SS), respectively, with the exception of the regular MSE and MSECB. As the correlations referred to are mostly “only” moderate, the observations are not indicative for a lack of additional information provided by any of the metrics within the groups.

3.4 Implications for a generalized metric

A holistic approach to capture HRTF differences should not be restricted to specific cases but should allow for an assessment of arbitrary spectral deviations. This includes, besides the four covered cases, also the comparison of measurements performed in different setups, the evaluation of errors ensuing from spatial interpolation or other HRTF processing methods. In the intended use of the metric, a-priori knowledge would therefore not be given; instead, the interplay of metrics should suggest the presence of one (or more) type(s) of spectral deviations. To enable such a description, the generalized metric should, at least to some degree, retain its multi-dimensional nature.

While much redundancy between metrics can be observed, its case-dependency advises against the elimination of a subset of metrics, as no metric can provide all of the information obtained by another. Instead, an approach should be chosen that exploits the specific variance of each of the metrics, e.g., based on PCA.

From the observations, it can be assumed that some metrics may react to specific spectral deviations. A classification of distance metrics into subgroups may describe this tendency, however, the reduction of the groups to one “representative” metric each may be counterproductive from a numerical perspective, as much of the variance would be lost. Nonetheless, a distinction should be made between the spectral variance explained by the metrics and the perceptual variance in audibility upon using the examined transfer functions for auralization. The findings do not rule out that fewer metrics may be sufficient for modeling auditory detectability. Nonetheless, it should be emphasized that it may be difficult to draw a direct connection between the spectral descriptors and auditory percepts. The subjectively perceived change is not only dependent on the two spectra being contrasted. Instead, the similarity (or lack thereof) between a given spectrum and an individuals’ “learned” spatial cues has a direct effect on their attributing the spectral change to a perceptual property, such as localization or sound coloration. Therefore, only a spatially inclusive analysis, applying knowledge on the listener’s own HRTF set, would allow for generalization.

4 Conclusion

The present work examined the behavior of and correlation between seven distance metrics that are applicable to HRTF comparison. As a case study, metric responses to differences between generic HRTFs as well as to errors introduced by HRTF reconstruction were assessed. It could be observed that metric reactions and interrelation are very dependent on the underlying spectral deviations. While non-individual cues in generic HRTF comparison lead to the largest values, two metrics (SSLLSE and CD) showed the strongest reactions to a loss in peak and notch quality – a spectral error missed by most of the other metrics, especially when the latter were based on squared mean calculations. Although some trends in moderate to strong correlation were consistent in the four examined cases, the correlation patterns showed a rather strong case dependency. The results indicate that a generalized metric for describing HRTF differences with no a-priori knowledge on underlying deviations should be modeled based on information from many (if not all) metrics, rather than reducing the latter to a representative subset. Future work could extend the study by additional metrics (including perceptual auditory models) and multi-variate approaches, such as Factor Analysis. A direct relation of the given metrics to audibility and to more specific anthropometric dissimilarities between individuals could furthermore be assessed.

Data availability statement

The presented distance metric data and the measured and reconstructed HRTFs in use can be downloaded from https://doi.org/10.18154/RWTH-2023-05668.

Conflict of interest

Part of this work has been previously presented at DAGA22, Stuttgart, Germany. The authors take complete responsibility for the integrity of the data and the accuracy of the analysis.

Funding

This work was funded by the German Research Foundation (DFG) – Project no. 402811912.

References

  1. H. Møller, M.F. Sørensen, D. Hammershøi, C.B. Jensen: Head-related transfer functions of human subjects. Journal of the Audio Engineering Society 43, 5 (1995) 300–321. [Google Scholar]
  2. V. Pulkki, M.-V. Laitinen, V. Sivonen: HRTF measurements with a continuously moving loud speaker and swept sines, in 128th Audio Engineering Society Convention, London, UK, May 22–25, 2010. [Google Scholar]
  3. J.-G. Richter, G. Behler, J. Fels: Evaluation of a fast HRTF measurement system, in 140th Audio Engineering Society Convention, Paris, France, June 4–7, 2016. [Google Scholar]
  4. J. He, R. Ranjan, W.-S. Gan: Fast continuous HRTF acquisition with unconstrained movements of human subjects, in Sound 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp. 321–325. [Google Scholar]
  5. J. Reijniers, B. Partoens, J. Steckel, H. Peremans: HRTF measurement by means of unsupervised head movements with respect to a single fixed speaker. IEEE Access 8 (2020) 92287–92300. [Google Scholar]
  6. B.F. Katz: Boundary element method calculation of individual head-related transfer function. I. Rigid model calculation. Journal of the Acoustical Society of America 110, 5 (2001) 2440–2448. [CrossRef] [PubMed] [Google Scholar]
  7. M. Otani, S. Ise: Fast calculation system specialized for head-related transfer function based on boundary element method. Journal of the Acoustical Society of America 119, 5 (2006) 2589–2598. [CrossRef] [PubMed] [Google Scholar]
  8. H. Ziegelwanger, P. Majdak, W. Kreuzer: Nu merical calculation of listener-specific head-related transfer functions and sound localization: Microphone model and mesh discretization. Journal of the Acoustical Society of America 138, 1 (2015) 208–222. [CrossRef] [PubMed] [Google Scholar]
  9. T. Xiao, Q. Huo Liu: Finite difference computation of head-related transfer function for human hearing. Journal of the Acoustical Society of America 113, 5 (2003) 2434–2441. [CrossRef] [PubMed] [Google Scholar]
  10. H. Takemoto, P. Mokhtari, H. Kato, R. Nishimura, K. Iida: Mechanism for generating peaks and notches of head-related transfer functions in the median plane. Journal of the Acoustical Society of America 132, 6 (2012) 3832–3841. [CrossRef] [PubMed] [Google Scholar]
  11. D.J. Kistler, F.L. Wightman: A model of head-related transfer functions based on principal components analysis and minimum-phase reconstruction, Journal of the Acoustical Society of America 91, 3 (1992) 1637–1647. [CrossRef] [PubMed] [Google Scholar]
  12. B.-S. Xie: Recovery of individual head-related transfer functions from a small set of measurements. Journal of the Acoustical Society of America 132, 1 (2012) 282–294. [CrossRef] [PubMed] [Google Scholar]
  13. J.C. Middlebrooks: Individual differences in external-ear transfer functions reduced by scaling in frequency. Journal of the Acoustical Society of America 106, 3 (1999) 1480–1492. [CrossRef] [PubMed] [Google Scholar]
  14. R. Algazi, C. Avendano, R.O. Duda: Estimation of a spherical-head model from anthropometry. Journal of the Audio Engineering Society 49, 6 (2001) 472–479. [Google Scholar]
  15. R. Nicol, V. Lemaire, A. Bondu, S. Busson; Looking for a relevant similarity criterion for HRTF clustering: A comparative study, in 120th Audio Engineering Society Convention, Paris, France, May 20–23, 2006. [Google Scholar]
  16. S. Shimada, N. Hayashi, S. Hayashi: A clustering method for sound localization transfer functions. Journal of the Audio Engineering Society 42, 7/8 (1994) 577–584. [Google Scholar]
  17. K.-S. Lee, S.-P. Lee: A relevant distance criterion for interpolation of head-related transfer functions. IEEE Transactions on Audio, Speech, and Language Processing 19 (2011) 1780–1790. [CrossRef] [Google Scholar]
  18. S. Davis, P. Mermelstein: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 4 (1980) 357–366. [CrossRef] [Google Scholar]
  19. J. Huopaniemi, N. Zacharov, M. Karjalainen: Objective and subjective evaluation of head-related transfer function filter design. Journal of the Audio Engineering Society 47, 4 (1999) 218–239. [Google Scholar]
  20. B. Xie, C. Zhang, X. Zhong: A cluster and subjective selection-based hrtf customization scheme for improving binaural reproduction of 5.1 channel surround sound, in 134th Audio Engineering Society Convention, Rome, Italy, May 4–7, 2013. [Google Scholar]
  21. R. Bomhardt, M. de la Fuente Klein, J. Fels: A high-resolution head-related transfer function and three-dimensional ear model database. Proceedings of Meetings on Acoustics 29, 1 (2016) 050002. [CrossRef] [Google Scholar]
  22. S. Hwang, Y. Park: Interpretations on principal components analysis of head-related impulse responses in the median plane. Journal of the Acoustical Society of America 123, 4 (2008) EL65–EL71. [CrossRef] [PubMed] [Google Scholar]
  23. R. Bomhardt, J. Fels: Individualization of head-related transfer functions by the principle component analysis based on anthropometric measurements. Proceedings of Meetings on Acoustics 140, 4 (2016) 3277–3277. [Google Scholar]
  24. A. Kulkarni, S.K. Isabelle, H.S. Colburn: On the minimum-phase approximation of head-related transfer functions, in Proceedings of 1995 Workshop on Applications of Signal Processing to Audio and Accoustics, New Paltz, NY, USA, October 15–18, 1995, pp. 84–87. [Google Scholar]
  25. M. Kohnen, R. Bomhardt, J. Fels, M. Vorländer: Just noticeable notch smoothing of head-related transfer functions, in Fortschritte der Akustik – DAGA 2018, Munich, Germany, March 19–22, 2018. [Google Scholar]

Cite this article as: Doma S. Brožová N. & Fels J. 2023. Examining the interrelation behavior of distance metrics for head-related transfer function evaluation: a case study. Acta Acustica, 7, 34.

All Tables

Table 1

Global 50th and 95th percentiles of the metric values of combined inter- and intra-individual HRTF comparisons. The 95th percentiles are used for normalization in Figures 2 and 3.

All Figures

thumbnail Figure 1

Classification of the selected HRTF distance metrics. Based on metric properties, e.g. the frequency resolution, the main arithmetic operation performed, and the consideration of perceptual properties of the auditory system (or lack thereof), different behavior of the metrics can be presumed.

In the text
thumbnail Figure 2

Inter-individual HRTF errors for different reference individuals i, sorted by the deviation of their anthropometric features from the database mean, as expressed by Δai (Eq. (8)). The black dots indicate the respective median values.

In the text
thumbnail Figure 3

Metric behavior for (a) inter-individual errors within the “measured” dataset (with IDs 13, 8, 1 as reference, from left to right) and for intra-individual errors between (b) the “measured” and “anthroPCA”, (c) the “measured” and “idealPCA”, and (d) the “idealPCA” and “anthroPCA” datasets. The box plots (top) visualize the distribution of metric values, normalized by the 95th percentile of each metric. Two example pairs of transfer functions are depicted in the middle row, with an offset of 20 dB introduced for better visibility. Corresponding (normalized) metric values are indicated by the gray and black lines superimposing the box plots. Pearson correlation coefficients with p < 0.05 after Bonferroni correction (bottom) display varying patterns, with most prominent differences present in case (c).

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.