Effects of interpersonal familiarity on the auditory distance perception of level-equalized reverberant speech

Familiarity with sound sources is known to have a modulatory effect on auditory distance perception. However, the level of familiarity that can affect distance perception is not clearly understood. A subjective experiment that aims to investigate the effects of interpersonal familiarity on auditory distance perception with level-equalized stimuli is reported. The experiment involves a binaural listening task where different source distances between 0.5 and 16 m were emulated by convolving dry speech signals with measured binaural room impulse responses. The experimental paradigm involved level-equalized stimuli comprising speech signals recorded from different-gender couples who have self-reported to have known each other for more than a year with daily interaction. Each subject judged the distances of a total of 15 different speech stimuli from their partner as well as spectrally most similar and most dissimilar strangers, for six different emulated distances. The main finding is that a similar but unfamiliar speaker is localized to be further away than a familiar speaker. Another finding is that the semantic properties of speech can potentially have a modulating effect on auditory distance judgements.


Introduction
The capability to localize sound sources in three dimensions provides distinct evolutionary advantages and is one of the two major ways alongside stereoscopic vision that humans can perceive space. Localization of sound sources in their azimuth and elevation has been extensively studied and the psychophysical mechanisms that govern subjective localization in these dimensions have largely been identified [1]. However, auditory distance perception, despite having exceptional theoretical relevance and practical importance, has not been studied in as much detail. Even less attention had been given to its cognitive aspects. This article presents the results of a study investigating the effects of familiarity on distance perception.
Unlike directional localization which mainly depends on interaural and spectral cues, distance perception occurs as a result of the fusion and interplay of many different relative cues that have physical, perceptual and cognitive origins. While the effects of certain perceptual cues such as intensity, direct-to-reverberant energy (D/R) ratio, sound source spectrum, auditory parallax and vocal effort on auditory distance perception are relatively easy to quantify, they do not constitute the only relevant cues [2]. These cues can be summarized as follows.
Intensity of point-like sources decrease by 6 dB per doubling of distance in acoustic free field, also known as the "inverse-square law", which acts as a dominant distance cue [3]. However, real sound sources are rarely point-like except at low frequencies. Since distance attenuation may deviate from the inverse-square law for sources that are not point-like, intensity may not always convey reliable distance information either as an absolute or as a relative cue when it is the only distance cue [4].
The ratio of the energies of the early and late parts of sound in a room aids the localization of a sound source. This ratio, also known as the D/R ratio, is known to be another dominant auditory distance cue [5]. Experiments show that the human ability to separately perceive the direct sound in a reverberant field depends on both the time delay between the direct sound and the reverberant components and the D/R ratio [4,6,7]. Although listeners are sensitive to changes in D/R ratio while estimating distance in a room, it is not known for certain whether D/R ratio or some other parameter that covaries with the latter is used as a cue [8].
Auditory distance perception is also affected by spectral cues [9]. The most likely explanation for this effect is atmospheric absorption which attenuates higher frequencies more than the lower frequencies [10]. Another acoustic binaural cue related to distance perception is auditory parallax [1] where interaural ear attributes change with the source distance.
An acoustic cue which is specifically related to conversational contexts is vocal effort. Distance judgements given to the same speech message whispered, spoken normally and shouted at the same nominal sound pressure level are different [11]. In comparison with normal speech, the distance of whispered speech is underestimated and the distance of shouted speech is overestimated. More specifically, in a communication context, the speaker adjusts their speech to overcome the degrading effect of distance attenuation and reverberation and to maintain the intelligibility of speech by the addressed party [12,13]. Such articulatory adjustments include changes not only in the sound pressure level, but also in the spectrotemporal properties of speech [14]. Familiarity with levels of vocal effort is known to aid in distance perception [15,16]. A more recent study showed that in a conversational setting where the speaker kept eye contact with the listener while speaking, visually perceived source distance and acoustic properties of the enclosure had significant effects on power, phonation time ratio, and the mean and standard deviation of fundamental frequency of the produced speech [17]. Such studies show the strong effect of vocal effort and more generally the contextual cues in distance perception in a conversational context. Presence of a visual anchor is also known to support auditory distance perception in interaction with the other acoustic cues [18,19].
Despite the strong impacts of perceptual cues summarized above, they are not the only cues that humans use to estimate the location of a sound source, and nonperceptual factors can also influence estimates of perceived distance of a sound source [4]. Previous studies have shown that affective state of cognition can be a modulating factor in auditory distance perception. For example, the participants in a fearful state judged the distance to the target to be closer than participants in a neutral state, judging targets to be reachable, at distances that were further than for the neutral group [20]. Familiarity with the sound source can also have a strong effect on auditory distance perception. Wisniewski et. al. compared the distance judgements of listeners to stimuli spoken in their native language (semantically and acoustically familiar) with stimuli spoken in a foreign language (semantically unfamiliar but acoustically familiar) and time-reversed speech (neither semantically nor acoustically familiar). The study showed that familiarity with the speech features could improve the accuracy of distance judgements [21]. Results of another, earlier study indicate that increasing familiarity with a sound source and its surrounding environment results in more accurate distance judgements [22].
Previous studies on auditory distance perception also show that humans tend to overestimate distances of sound sources in the peripersonal space and they tend to underestimate the distances of sound sources in the extrapersonal space [6,23]. Auditory distance perception in the front direction is mostly accurate for sound sources positioned approximately 1 m away from the listener.
Two representations of distance perception were previously proposed. The first representation relates the directto-reverberant energy ratio to subjective distance in rooms [5], combining strictly physical properties such as the reverberation radius of the room and the directivity factor of the sound source. A second, more parsimonious representation [7] is a compressed power function relating actual to perceived distance in a way similar to Stevens' power law [24]. Both representations incorporate asymptotic behavior corresponding to an auditory horizon beyond which no source is localized. Similarly both representations capture the experimental findings that auditory distances of sources closer than about 1 m are overestimated and of those further than about 1 m are underestimated.
This paper is concerned with the effect of interpersonal familiarity on auditory distance perception. Based on this, the hypothesis under test is that, distance of a familiar speaker is localized more accurately than the distances of unfamiliar speakers uttering the same sentences. We investigate this hypothesis in a distance perception experiment where the subjects rate the distances of familiar and unfamiliar speakers in a binaural listening task with level equalized stimuli. We presuppose that individuals who have daily interactions for an extended period of time have familiarity to each other's voices. Therefore, the cohort of participants employed in this study comprised couples who know each other for at least one year and have daily interactions, also making sure that each participant personally knew only their partner and not the other participants. The participants were asked to judge the distances of binaural stimuli emulating different source distances using speech signals consisting of five balanced sentences spoken by the participant's partner as well as spectrally most similar and most dissimilar participants to the latter. Our initial assumption was that an unfamiliar speaker's voice with a high longterm spectral similarity to a familiar speaker's voice would be localized less accurately than the familiar speaker and more accurately than the unfamiliar speaker with a low spectral similarity.
Similar to some earlier auditory distance perception studies [25][26][27][28][29] that aimed to investigate reverberation cues for auditory distance perception and how participants perceive distance based on changes in D/R ratio, this study also employs stimuli equalized for overall broadband level. This was done in order to eliminate the effect of level as the most dominant distance cue leaving D/R ratio as the main distance cue. The main intention herein was to make it easier to isolate a posteriori, the effect of interpersonal familiarity as a modulating factor in distance perception of reverberant speech.
The results indicated that the familiarity and the semantic content of the stimuli as well as their interaction have statistically significant effects on the distance judgements. However, this effect did not occur in the way that we initially predicted. Our findings are: (1) spectrally most similar speaker to a familiar speaker is localized further away than a familiar speaker, and (2) the distances of a familiar speaker and the unfamiliar speaker with the lowest long-term spectral similarity are localized similarly. These findings indicate that while interpersonal familiarity is a modulating factor of distance perception, its effect is non-linear.

Participants
A total of 24 unpaid participants comprising 12 different-gender couples, ranging in age from 19 to 30, participated in the study. All couples were partners for at least one year with daily interaction. The participants acted in two ways: (1) their speech was recorded to be used as stimuli in the experiment, and (2) they provided distance judgements in the experiment. All of the participants were native Turkish speakers. None of the participants reported any hearing-related or vocal problems. Also, none of the participants had any prior experience of listening experiments.
One of the couples from the originally employed group decided not to take part in the listening experiment and was removed from the set of participants. However, the speech signals obtained from those subjects were retained for the listening experiment. Therefore, a total of 22 participants (11 male and 11 female with an age range of 21-30 years) took part in the listening experiment.

Stimuli
Stimuli consisted of binaural emulations of different sound source distances obtained by convolving dry speech signals from the participants with binaural room impulse responses measured in a reverberant skyway.

Speech signals
Speech signals used to produce the binaural stimuli were recorded in the METU Spatial Audio Research Group (SPARG) Lab which has a floor area of 11 m 2 and a low reverberation time of T 30 = 80 ms. The background noise within the lab is below 40 dB SPL. The structure is specially designed with oblique walls to avoid strong standing waves. The walls are covered with 10 cm thick mineral wool screened with acoustically transparent textile to reduce reflections. Therefore, the recordings were sufficiently dry and were treated as being practically anechoic as far as the present experimental conditions are concerned.
Five different sentences in Turkish language were constructed using words that had been scaled as being emotionally neutral [30]. These sentences and their English translations are: S1: Dünya gezegendir. Note that all of these sentences have an equal number of six syllables and are thus balanced. Note also that since Turkish is a gender-neutral language, these sentences do not contain any gender-specific information.
Recordings were made using a cardioid vocal microphone (BM-800) positioned 1 m away from the speaker. The height of the microphone and its acoustic axis were adjusted to match the speaker's mouth before each recording. The speakers were instructed to maintain their regular speaking level avoiding any additional vocal effort and emotional prosody during the recordings. The sampling rate was 44.1 kHz and amplitude resolution was 32-bits. There were two reasons for recording speech at a distance from the speaker and not using a miniature microphone positioned close to the speaker's mouth: (1) in order to reduce the proximity effect that accentuates low frequencies due to recording in the near-field of the sound source [31], and (2) in order to provide a visual anchor according to which the speakers can adjust their voices.
In order to assess the effects of interpersonal familiarity, speech signals to be used in the experiment were selected differently for each participant. We defined three familiarity categories for this purpose: (1) Familiar (V0) represents the sentence as spoken by the participant's partner, (2) Similar-Unfamiliar (V1) corresponds to the same sentence as spoken by an unfamiliar speaker with the highest longterm spectral similarity, and (3) Dissimilar-Unfamiliar (V2) corresponds to the same sentence as spoken by an unfamiliar speaker with the lowest long-term spectral similarity.
The long-term spectral similarity was calculated using the Euclidean distances between normalized vectors representing the band energies of speech signals processed using a 64 channel Gammatone filterbank with center frequencies matching the center frequencies of equivalent rectangular bandwidth (ERB) filters [32].
Since the participants would have been familiar with the presented stimuli after the experiment, a repetition of the experiment with the same set of participants would not have been possible.

Binaural stimuli
An observational study of auditory distance perception is in general not feasible due to the difficulties involved in controlling the experimental conditions. A practical alternative involves presenting the subjects with binaural recordings that emulate natural spatial hearing scenarios with the stimuli presented over a pair of headphones.
In order to achieve flexibility in the creation of different stimuli corresponding to different source distances, binaural room impulse responses (BRIRs) were measured in the skyway connecting the two wings of the METU Graduate School of Informatics. The dimensions of the skyway are 18 Â 1.8 Â 2.5 m. The sides of the skyway are glass, the floor is marble and the ceiling is made of curved corrugated metal. The reverberation time was measured as T 30 = 1.15 s.
A binaural microphone (Neumann KU-100) was placed at a height of 1.5 m and on the midline 1 m away from one end of the skyway. BRIRs in the front direction of the binaural microphone were measured using a small, two-way active loudspeaker (Genelec 6010A) positioned 0.5, 1, 2, 4, 8, and 16 m away from the binaural microphone at 1.5 m height as shown in Figure 1. The decision to select these particular distances is motivated by an earlier study of auditory distance perception using a similar logarithmic spacing [23]. The BRIRs were measured using the logarithmic sine sweep method [33].
A total of 90 stimuli per participant (5 sentences, 6 distances, 3 familiarity categories) were prepared by convolving the previously recorded dry speech signals with the measured BRIRs. This way, each participant received a different set of stimuli.

Level equalization
The recording levels in the nearly-anechoic conditions under which the dry speech signals were captured were similar but not identical across participants. Similarly, the energy of the late reverberation was assumed to be largely the same for all recording distances. The presentation levels of the stimuli were therefore equalized to eliminate the possible confounding effect of level. Considering earlier studies showing that D/R ratio is a distance cue that can be as effectively used as level especially at high reverberation levels [27], our assumption was that reliable responses can be elicited with level-equalized stimuli.
Level equalization resulted in stimuli with similar total energy whilst substantially preserving the change in the level of the direct sound according to the inverse-square law, consequentially preserving the relative direct path level cues and the relative D/R ratios with respect to the source distance. The motivation behind this choice was to prevent the participants from adapting to the levels of the presented stimuli. In other words, level equalization served to eliminate the dominant effect of level cues, which may have confounded the responses since level is both a physical (i.e. due to sound propagation) and a cognitive (i.e. due to familiarity with the speaker's nominal speech level) cue. Therefore, the participants were forced, in effect, to use D/R ratio as the only available dominant distance cue, making the differences observed in the elicited responses attributable either to D/R ratio differences or to the differences in interpersonal familiarity. 1 The physical effects of level equalization on the relative levels of direct sound and reverberation are discussed in the Appendix.
Level equalization was carried out by normalizing binaural stimuli for their total broadband energy with respect to the highest total energy across the employed left and right ear signals. The presentation level was then selected in the following way. A 1 kHz sinusoidal signal was played back from the headphones employed in the listening test (Superlux HD-330) and the reproduction level was measured using a calibrated miniature microphone positioned at the entrance of the right ear canal of the binaural microphone using a sound level meter (Faber Acoustical Sound Meter). The level of the headphone amplifier (Focusrite Scarlett 18i8) was adjusted to and fixed at 80 dB SPL. The processed speech signals were then played back via the same pair of headphones and the sound level was measured. The average peak and equivalent sound pressure levels, L p and L eq , were 77.3 dB and 66.1 dB for female speech samples and 77.5 dB and 65.9 dB for male speech samples, respectively. These values are generally in agreement with the average level of running speech [34]. While the presentation level varied slightly for each stimulus, the variance of L eq was less than about 2 dB for most cases.

Experimental task
The 22 participants whose voices were recorded earlier also took part in the listening experiment. Each participant attended the experiment alone. Level equalized stimuli were presented over headphones in the METU SPARG Lab. The participant's task was to listen to the presented stimulus and to adjust a vertical slider to indicate the perceived distance. The participants were allowed to listen to each stimulus only once. The end points of the slider represented 0 and 20 m distance. The reason for selecting an upper limit of the response distance was based on our assumption that source distances would be underestimated on average and especially so for sources at larger distances. OpenSesame software [35] was used to set up, control and run the experiment. In order to alleviate the effects of visual feedback on distance perception, the room was darkened except for the light coming from the computer screen. The stimuli presented to the participants were randomized for distance, sentence and familiarity and each stimulus was presented only once. The participants were not informed about how the stimuli were obtained. Overall, each participant provided 90 responses. The experiment took approximately 10 min on average per participant to complete.
A training session was run for each participant prior to the actual experiment in order to acquaint them with the user interface as well as the experimental task. Binaural stimuli produced with speech signals different from the ones used in the experiment were used during the training session. In order to demonstrate the range of stimuli that the participants will hear, three examples from the nearest (i.e. 0.5 m) and three examples from the furthest (i.e. 16 m) distance were presented. No feedback was provided about the actual distances of the stimuli. The stimuli were assumed to be well-externalized since none of the participants reported inside-the-head localization.
After the experiment, one of the participants stated that he did not fully understand the experimental task. The data collected from that participant were discarded.

Results
Following methods used in previous auditory distance perception studies [1,2,6,7,36]; a compressive power function proposed by Zahorik [7] was fit to collected data. The representation defines the relation between the actual distance, D and the perceived distance D as a compressive power function in the following form: where b is a constant and a the power law exponent. The compressive model parameter values present a high-level summary of the collected responses. The parameters, a and b, of the compressive model represent how compressive the auditory distance perception is, and how much the distance of a source positioned at 1 m is over-or underestimated, respectively. The ideal case where the simulated and the perceived distances are equal corresponds to a = b = 1. A higher value of a < 1 corresponds to a lower compression resulting in a higher slope in the log perceived distance vs. log actual distance line. A higher value of b > 1 corresponds to a higher overestimation of the distance of a source positioned at 1 m and vice versa. After eliminating the outliers (defined as a data point outside 1.5 times the interquartile range after logarithmic transform of the data), the geometric means of the collected responses were calculated and least-squares fits of power functions as described above were obtained. Figure 2 shows the aggregate responses for each participant as well as the power function fits to geometric means of responses for each simulated distance.
It was observed that some of the participants had very low a values resulting in a flat curve indicating either that (i) the participant responded completely randomly or (ii) they substantially and consistently provided similar responses regardless of the stimulus. Either situation is indicative of a problematic response pattern. Therefore, responses from participants with an a that is one standard deviation below the mean were discarded from the analyses. This corresponds to 4 participants (P16, P17, P18, P19) over the entire set of 21 participants. The mean values of the fitted parameters of the remaining 17 participants were a = 0.31 ± 0.06 and b = 3.56 ± 1.12. The average coefficient of determination was R 2 = 0.82 indicating a good overall fit. As a second step, an analysis of variance (ANOVA) was used to examine the effects of the tested factors on distance perception. The factors are Familiarity, Sentence and Distance. The dependent variable is the log-distance error defined as: When E < 0, the source distance is overestimated in comparison with the actual distance (i.e. D > D) and vice versa. Also, E % 0 indicates veridical localization of distance since (i.e. D % D). In other words, a response with a small absolute log-distance error is considered more veridical. Notice that logarithmic transformation of the perceived and actual distances were used to normalize the variance of the error across different simulated distances. The log-distance errors obtained this way were confirmed via individual Kolmogorov-Smirnov tests [37] to be normally distributed, thereby allowing the use of an ANOVA model for further analysis.
The statistical model used for analyzing the elicited responses was a 3 Â 5 Â 6 ANOVA where the main factors were Familiarity, Sentence, and Distance. Since an initial analysis revealed that none of the interaction terms was significant except for Familiarity Â Sentence, these terms were removed in order to increase the statistical power of the employed model.
ANOVA indicates that differences between the mean log-distance errors are statistically significant for Distance Post-hoc comparisons were carried out by bootstrapping with 1000 repetitions and using the Bonferroni correction. Post-hoc comparisons for different sentences show that the differences of log-distance error between S5 and the other sentences are statistically significant (p < 0.001). In all cases average log-distance error difference is negative, indicating that S5 is localized at a closer distance on average than other sentences.
Post-hoc comparisons of different familiarity levels revealed that the differences between the log-distance error means of the familiarity levels, V0 and V1 (M = 0.111,   [38]. Figure 4 shows the log-distance errors for different familiarity levels and distances.
The differences between the mean log-distance errors are statistically significant across all pairs of distances with p < 0.001. In all cases, the effect sizes ranged from medium to large. The absolute log-distance error linearly increased with increasing actual log-distance, indicating a higher level of underestimation at larger distances.
Another question related to distance perception accuracy is about the effect of familiarity on the variability in the elicited responses. We used coefficient of variability (CV), defined as the ratio of the standard deviation to the mean, in order to assess whether the responses given to different levels of familiarity differed in their dispersions (Fig. 5). CV was calculated separately for each participant and for all sentences over responses grouped according to distance and familiarity level. ANOVA was used to investigate the possible differences between values of CV calculated for different experimental conditions. ANOVA

Discussion
The main finding of the presented study is that interpersonal familiarity is a factor that can modulate the auditory perception of distance of reverberant speech. A more detailed discussion of this and other findings follows.

Main findings
The compressive model parameter values obtained using the aggregate responses of each participant indicate that for nearby sound sources, especially within a distance  less than about 3.5 m on average, participants tended to overestimate the source distance, while they underestimated distances of sources that are further away. This is slightly different than earlier results in the literature where subjects start to overestimate distance when the source distance is less than about 1 m. This difference is very likely to occur due to the employed level equalization which results in approximately 3 dB difference between the actual and the normalized direct path components for source distances greater than 1 m (see Fig. A.1 in Appendix). In other words, level-equalized stimuli have a lower D/R ratio than natural stimuli for distances larger than the critical distance of the room. Therefore, source distances are overestimated more on average than in studies that employ non-equalized, naturalistic stimuli [36,39] since distance perception substantially relies on the D/R ratio as the predominant distance cue in the absence of absolute level cues.
The compressive model parameter values we obtained can be compared with the results of recent studies employing a similar binaural presentation approach. In one of these studies, Anderson and Zahorik [36] used windowed Gaussian noise convolved with BRIRs measured in a concert hall with a reverberation time of T 60 = 1.9 s and reported an average a of 0.61 ± 0.30 and an average b of 2.22 ± 1.99. In another study using anechoic speech signals convolved with synthetic BRIRs, increasing the reverberation time was found to increase distance estimation errors and average parameter values of a = 0.53 and b = 1.89 were reported for a reverberation time of T 60 = 0.86 s [39]. In comparison with both of these studies, the parameter values observed in the reported experiment indicate more compression (i.e. lower a values) and a higher distance offset (i.e. higher b values). These differences might be due to the employed stimuli, the elongated shape and the resulting acoustic features (e.g. lateral reflections) of the skyway, level equalization, or the lack of vocal effort as a covariate factor in the experimental design.
Familiar speakers (V0) were on average judged to be closer than unfamiliar speakers that have the most similar long-term spectra (V1). This difference was observed both as differences between the compressive model parameter values of these two conditions and via post-hoc comparisons applied on the log-distance errors.
Based on the compressive function fits to aggregated responses, the value of the parameter a was highest for V0 (a = 0.36), followed by V2 (a = 0.34) and V1 (a = 0.31), respectively. In a similar way b was lowest for V0 (b = 2.52) followed by V2 (b = 2.92) and V1 (b = 3.30), respectively. In other words, the familiar speaker is on average judged to be closer and more accurately localized in distance than an unfamiliar speaker, irrespective of its spectral similarity. These results indicate the possible role of long-term exposure to a familiar speaker's voice in increasing subjective distance localization accuracy. The higher level of the value of the parameter, b, for unfamiliar speakers can be linked to differences in proxemic behavior towards familiar and unfamiliar people, where a smaller interpersonal distance is typically preferred for the former and a larger interpersonal distance for the latter [40,41].
The statistical analysis of the log-distance errors revealed that the difference between the log-distance error means is statistically significant for V0 and V1. 2 In contrast, no statistically significant difference exists between either familiar (V0) and unfamiliar speakers with dissimilar long-term spectra (V2) or the latter with similar (V1) longterm spectra. This is an unexpected result considering the assumption that V1 would constitute an intermediate step between V0 and V2 in some familiarity continuum. This mismatch between our expectation and the observed results might possibly be due to the spectrally most similar speaker being judged as more unfamiliar than the dissimilarunfamiliar speaker. Such a difference can be attributed to the perceptual magnet effect which occurs as the perception of a stimulus is modulated by the categorical perception of that stimulus as familiar or unfamiliar. A similar-unfamiliar stimulus closer to a familiar/unfamiliar decision boundary can be perceived as being even more unfamiliar than a dissimilar-unfamiliar stimulus [42,43]. It is likely that familiarity with other features of speech signals such as vocal effort, rhythm, intonation and other prosodic features can also have analogous effects to spectral similarity in modulating distance judgements. A future study might be envisaged, that uses speech signals generated via morphing between the familiar and dissimilar-unfamiliar speaker voices to provide a clearer understanding of the effect not only of spectral similarity but also of other prosodic features on auditory distance perception, possibly also taking into account the perceptual magnet effect.
Another finding concerns the effect of semantic content as observed from the statistically significant differences between mean log-distance errors of S5 and other sentences. We speculate that the contextual cues in S5 calling for joint attention (i.e. to a spoon) in close vicinity (i.e. in the drawer) in simple present tense might have resulted in an underestimation of distance judgements for that sentence on average. This finding points at the possible modulating effect of semantic cues on the auditory distance perception of reverberant speech.
Finally, the distance of the stimulus S5 when uttered by the familiar speaker is perceived to be closer than combinations of other sentences with other speakers, familiar or unfamiliar. This finding is a possible evidence of the combined effect of familiarity and semantic content in modulating distance perception.

Limitations
The employed experimental design allows tightly controlling the experimental variables. However, it has also likely resulted in level, D/R ratio, vocal effort combinations that do not necessarily correspond to real-life listening scenarios. For example, familiarity with the expected vocal effort combined with visual cues has a strong effect on distance perception in a conversational context [36]. However, this study did not include vocal effort as a factor or covariate.
Level equalization resulted in the modification of absolute level cues, which could have been used by the participants as reference points. The lack of overall level as a cue is one of the possible reasons for the increased variability in the elicited responses in comparison with earlier studies (e.g. [2]). Similarly, the increased overestimation of source distance beyond the room's critical distance can be attributed to the non-linear change in the direct path level (see Fig. A.1 in the Appendix).
Another possible limitation concerns the finite upper limit of the employed scale. Some of the participants used the full scale and judged some of the stimuli to be further away than 16 m, which is where the furthest source emulated in the experiment was positioned. This upper limit might have biased the results, especially for larger stimulus distances. Similarly, a source discrimination or identification task instead of the employed absolute distance task could have also decreased the response variability by rendering the experiment easier for the participants.
Another possible limitation is that the participants themselves did not make similarity judgements for the presented stimuli in order to ensure that they were not familiarized with the employed stimuli prior to the experiment. It is possible that more appropriate ways than Euclidean distance based approach can be found to quantify spectral similarity. For example, an A-weighted distance metric to account for the frequency-dependent sensitivity of the human auditory system could be more appropriate.

Conclusions
We presented the results of a binaural listening experiment where the participants judged the distances of familiar and unfamiliar speakers using level-equalized stimuli. The experiment involved the headphone based presentation of five balanced and nominally neutral sentences spoken by different-gender couples, convolved with BRIRs recorded in an elongated skyway. Each participant listened to speech signals from three different participants including their longterm partner and two unfamiliar people of opposite gender with the lowest and highest long-term spectral similarity to the participant's partner, respectively. The participants then judged the apparent distances of the stimuli that they listened to, using a scale with 0 and 20 m as its endpoints.
The study showed that interpersonal familiarity modulates auditory distance perception. Distance of a binaurally presented reverberant speech signal from a familiar speaker is on average more accurately localized than that from an unfamiliar speaker. However, the difference is only statistically significant between familiar and similar-unfamiliar conditions. The study revealed another interesting finding that is counterintuitive at first sight: the distance of a spectrally similar and unfamiliar speech signal is localized even further than the spectrally dissimilar and unfamiliar signal. We speculate that this finding is related to the perceptual magnet effect.
To the best of the authors' knowledge this is the first study in literature investigating the effects of interpersonal familiarity on auditory distance perception. Long-term familiarity is very likely to modulate distance perception also under natural listening conditions, and possibly more prominently than in the experiment we reported, since additional cues such as absolute level, vocal effort and prosody would also be available. From an applied perception point of view, the findings can have practical implications in the domains of human-robot or human-agent interaction as well as the design of shared virtual and mixed reality environments.

Conflict of interest
Authors declared no conflict of interests.
Let us now define a d (D) as the true level of the direct path for a source in acoustic free-field at a distance D and let us parameterize the ratio of direct and reverberant energies such that q(D) = |a d (D)|/ ffiffiffi ffi r p . Notice that q(D) is proportional to a d (D) since the energy of the reverberant component, r , is independent of the source and receiver positions in the room. The level normalized direct path becomes: Attenuation due to distance follows the inverse-square law and the following relation holds for the sound pressure levels due a source positioned at D 1 and the same source positioned at D 2 : ÁL D1;D2 ¼ 20log 10 a d ðD 1 Þ a d ðD 2 Þ ¼ 20log 10 D 2 D 1 ðdBÞ: ðA:15Þ Note also that since the reverberant energy is constant: For the level normalized signal, the following relation holds: We will now show that for the skyway that we made our BRIR measurements, the change in the direct path levels of level-equalized stimuli follow the inverse-square law within a small and bounded error. The critical distance, D c in a room is the distance at which the energies of direct and reverberant components are equal such that q(D c ) = a d (D c )/ ffiffiffi ffi r p = 1. The critical distance for a room with volume V and a reverberation time of T, and with a sound source having a directivity gain of c is given as [3]: where L Dc = 20log 10 |a d (D c )|. Notice that lim D!1LD = L D À 3.01 dB. In other words, the error induced by level equalization on the direct part component is always less than 3.01 dB.
Substituting V % 76 m 3 , T 60 = 1.15 s and assuming an average directivity gain of c % 2 for the Genelec 6010 loudspeaker we used, the critical distance for the skyway that we made our measurements can be estimated as D c % 0.65 m. In order to make this difference more comprehensible within the range of sound pressure levels employed in this article, let us now designate the direct path level at the critical distance, a d (D c ) as an arbitrary reference, such that L Dc = 70 dB (SPL). Figure A.1 shows the actual and normalized direct path levels which differ from actual levels by 1.28 dB at the closest to À3 dB at the furthest emulated distances. These differences are small and the distance attenuation trend closely follows the attenuation trend for the non-equalized case, especially beyond the critical distance. Therefore, the employed level equalization does not eliminate relative level cues pertaining to the level of the direct path component. In other words, the intensities of direct path components of level equalized stimuli also substantially attenuate by 6 dB per doubling of distance. Since the reverberant energy is assumed to be constant in a room regardless of the measurement position, the relative changes in the D/R ratio will also closely follow the non-equalized stimuli, retaining the relevant relative cues in the stimuli therein.