Open Access
Issue
Acta Acust.
Volume 8, 2024
Article Number 28
Number of page(s) 13
Section Speech
DOI https://doi.org/10.1051/aacus/2024032
Published online 28 August 2024

© The Author(s), Published by EDP Sciences, 2024

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

Hearables, i.e., smart earpieces containing a loudspeaker and one or more microphones, are often used for speech communication in noisy acoustic environments. In this paper, we consider the scenario where the hearable is used to pick up the own voice of the user talking in a noisy environment (e.g., to be transmitted via a wireless link to a mobile phone or another hearable). Assuming that the hearable is at least partly occluding the ear canal, in this scenario an in-ear microphone may be beneficial to pick up the own voice since environmental noise is attenuated. Compared to own voice recorded at the outer face of the hearable, own voice recorded inside an occluded ear is known to suffer from amplification at low frequencies (below ca. 1 kHz) and strong attenuation at higher frequencies (above ca. 2 kHz), leading to a limited bandwidth [1]. The occlusion effect is determined by the ratio between the air-conducted and body-conducted components of own voice, which depends on device properties such as earmould fit and insertion depth [2], individual anatomic factors such as residual ear canal volume and shape [3, 4], and the generated sounds or phonemes [5, 6]. In particular, it has been shown that the occlusion effect for different vowels can be predicted by a linear combination of their formant frequencies [7], with closed front vowels exhibiting the largest occlusion effect. In addition, mouth movements during articulation [8] and body-conduction from different places of excitation [9] likely influence the occlusion effect as well. Unlike acoustical models based on ear canal geometry [3] or three-dimensional finite element models of body-conduction occlusion [10], in this paper we consider a signal processing-based approach to model the own voice transfer characteristics between a microphone at the entrance of the occluded ear canal (i.e., at the outer face of the hearable) and an in-ear microphone.

In many hearable applications, acoustic transfer path models for the microphone inside the occluded ear canal are required. For example, active noise cancellation algorithms may benefit from an accurate estimate of the so-called secondary path between the hearable loudspeaker and the in-ear microphone [11, 12]. In active occlusion cancellation (AOC), models of the own voice transfer path between the microphones inside and outside the occluded ear canal can be used to generate a cancellation signal that aims at compensating the occlusion effect as measured at the in-ear microphone [13, 14]. Models of the own voice transfer path are not only relevant for AOC, but also for algorithms to enhance the quality of the in-ear microphone signal picking up the own voice of the user. Several own voice reconstruction algorithms aiming at bandwidth extension, equalization and noise reduction have been proposed, e.g., based on classical signal processing [15] or supervised learning [1619]. Supervised learning-based approaches typically require large amounts of training data. Since large amounts of realistic in-ear recordings may be hard to obtain for several talkers, an accurate and possibly individual model of the own voice transfer characteristics would be highly beneficial. Such a model would enable to generate large amounts of simulated in-ear signals either from recordings at the entrance of the ear canal or from speech corpora, e.g., [20]. Data augmentation can then be performed with these simulated in-ear signals to train supervised learning-based own voice reconstruction algorithms. Similarly as for other acoustic signal processing applications [2123], it is expected that using more accurate acoustic models for generating augmented training data improves system performance and generalization ability.

Several models of own voice transfer characteristics have been presented in the literature, either between two air-conduction microphones [17] or between an air-conduction and a body-conduction microphone [16, 18, 24]. In [24], it has been proposed to convert air-conducted to bone-conducted speech using a deep neural network (DNN) model that accounts for individual differences between talkers based on a speaker identification system. In [16], a DNN model estimating bone-conducted speech from air-conducted speech is jointly trained with a multi-modal enhancement network within a semi-supervised training scheme, resulting in reduced data requirements compared to fully supervised training. Instead of using rather complicated black-box DNN models, in [17, 18] time-invariant linear relative transfer functions (RTFs) are used to model own voice transfer characteristics. To introduce variations in the simulated own voice signals, either RTFs estimated on recordings of multiple talkers are used [17], or random values are added to the magnitude of the RTF estimated from a single talker [18]. It should be realized that these variations do not account for the speech-dependent nature of the own voice transfer characteristics.

Aiming at obtaining a model of the own voice transfer characteristics that generalizes well to unseen utterances and talkers, in this paper we propose a speech-dependent system identification approach, where for each phoneme a different RTF between the microphone at the entrance of the occluded ear canal and the in-ear microphone is estimated. We consider both individual and talker-averaged models. To simulate in-ear own voice signals from broadband speech, a phoneme recognition system is first utilized to segment the broadband speech into different segments corresponding to a specific phoneme, which are then filtered using the corresponding (smoothed) phoneme-specific RTFs. In contrast to previous RTF-based modeling approaches [17, 18], the proposed model of own voice transfer characteristics is speech-dependent and thus time-varying. In addition, contrary to the DNN-based modeling approach [16], only a small amount of own voice recordings are required for model estimation. The accuracy of simulating in-ear signals is assessed using recorded own voice signals of over 300 utterances by 18 talkers, each wearing a prototype hearable device [25]. The role of speech-dependency for simulating in-ear own voice signals is investigated by comparing the proposed speech-dependent RTF-based model to a speech-independent RTF-based model, and an adaptive filtering-based model [26] which is utterance-specific. Experimental results show that the proposed speech-dependent model enables to simulate in-ear own voice signals more accurately than the speech-independent model and the adaptive filtering-based model in terms of technical distance measures. In addition, the performance of individual and talker-averaged models is compared in terms of their generalization capability to unseen talkers. Results show that the speech-dependent talker-averaged model generalizes better to utterances of unseen talkers compared to speech-independent or individual models. Preliminary results of the proposed approach have already been published in [27]. This paper extends upon previous work presented in [27] by proposing talker-averaged models, by investigating utterance and talker mismatch separately, and by conducting experiments on a larger corpus of hearable recordings.

The paper is structured as follows. In Section 2, the own voice signal model is introduced. In Section 3, several system identification approaches to model own voice transfer characteristics using time-invariant or time-varying linear filters are presented. In Section 4, the performance of these models is evaluated using recorded own voice signals for different conditions.

2 Signal model

Figure 1 depicts a hearable device equipped with an in-ear microphone and a microphone at the entrance of the (partly) occluded ear canal. The signals at both microphones are denoted by subscripts i and o, respectively. We assume that the hearable is worn by a person (referred to as talker) in a noiseless environment. In the time domain, and denote the own voice component of talker a at both microphones, where n denotes the discrete-time index. The in-ear microphone signal consists of the own voice component and additive noise, i.e.,

(1)

where the noise component consists of unavoidable body-produced noise (e.g., breathing sounds, heartbeats). Similarly, the microphone signal at the entrance of the occluded ear canal can be written as

(2)

where mainly consists of sensor noise. The sensor noise is assumed to be negligible compared to the own voice component in both microphone signals. The own voice components of talker a at the in-ear microphone and the microphone at the entrance of the occluded ear canal are assumed to be related by the own voice transfer characteristics Ta{·}, i.e.,

(3)

thumbnail Figure 1

The own voice signal model for a hearable with two microphones (outer face, in-ear).

Due to individual anatomical differences of the ear canal [4], these transfer characteristics depend on the talker. In addition, it has been shown that these transfer characteristics depend on the spoken sounds [5, 6] (see also Fig. 7).

In this paper, we assume that the own voice transfer characteristics Ta{·} can be modeled as a time-varying linear system, i.e.,

(4)

with

(5)

The vector h[n] denotes a time-varying finite impulse response (FIR) filter with N coefficients,

(6)

with {·}T the transpose operator, and the vector q is defined as [28]

(7)

with q−1 the delay operator. The filtering operation in (4) can be approximated in the short-time Fourier transform (STFT) domain as

(8)

where k denotes the frequency bin index, l denotes the time frame index and Ha(k, l) denotes the relative transfer function (RTF) between the microphone at the entrance of the occluded ear canal and the in-ear microphone. Different from (4), this approximation is only time-varying between STFT frames and not within a single STFT frame.1

3 Modeling of own voice transfer characteristics

In this section, several methods are presented to model own voice transfer characteristics and subsequently simulated in-ear own voice signals. As outlined in Figure 2, in the system identification step the parameters θ of the model are estimated (either in time domain or in frequency domain) based on the signals recorded at the in-ear microphone and the microphone at the entrance of the occluded ear canal. In the simulation step, this model can then be used to generate simulated in-ear own voice signals from microphone signals at the entrance of the occluded ear canal, i.e.,

(9)

thumbnail Figure 2

Overview of the system identification and simulation steps of the own voice transfer characteristic models.

Both individual models for a specific talker and talker-averaged models will be considered. In Section 4 it will be experimentally investigated whether talker-averaging increases robustness to talker mismatch. To estimate the individual model for talker a, recorded microphone signals from talker a are used. This model can then be used to simulate in-ear signals either for the same talker a and the same recorded microphone signals (same talker, same utterance), for different utterances of talker a than used during system identification (utterance mismatch), or for utterances of another talker b (talker mismatch). To estimate the talker-averaged model , recorded microphone signals from several talkers are used.

Sections 3.13.3 consider RTF-based frequency-domain models for the own voice transfer characteristics. In Sections 3.1, a speech-independent time-invariant model for a specific talker is presented, similarly as in [17]. In Section 3.2, a speech-dependent model for a specific talker is proposed, which accounts for the time-varying own voice transfer characteristics by assuming a different RTF for each phoneme. Section 3.3 describes how to compute talker-averaged speech-independent and speech-dependent models. Contrary to Sections 3.13.3, in Section 3.4 an adaptive filtering-based time-domain model of own voice transfer characteristics is presented, which is utterance-specific.

3.1 Speech-independent individual model

If own voice transfer characteristics are assumed to be speech-independent, the individual transfer characteristics of talker a can be modeled as a time-invariant RTF Ha(k) between the microphone at the entrance of the occluded ear canal and the in-ear microphone:

(10)

where K denotes the STFT size. Assuming that the own voice component at the entrance of the occluded ear canal and the body-produced noise are independent, in the system identification step the RTF can be estimated using the well-known least squares approach [29], i.e.,

(11)

considering all STFT frames of the recorded microphone signals from talker a used for system identification. The least-squares RTF estimate is obtained as

(12)

where * denotes complex conjugation. In the simulation step, own voice speech of talker b recorded at the microphone at the entrance of the occluded ear canal is filtered in the STFT domain with the RTF estimate of talker a (where talker a and b can be the same or different), i.e.,

(13)

After applying the inverse STFT, a weighted overlap-add (WOLA) scheme is employed to obtain the time domain signal . Figure 3 depicts the signal flow to simulate in-ear own voice signals for talker b using the speech-independent individual model for talker a.

thumbnail Figure 3

Simulation of in-ear own voice signals for talker b using the speech-independent model for talker a.

3.2 Speech-dependent individual model

Since own voice transfer characteristics likely depend on speech content, we propose to model the transfer characteristics Ta of talker a using a time-varying speech-dependent model. In the system identification step, first a frame-wise phoneme annotation p(l) ∈ 1, …, P with P possible phoneme classes is obtained from the microphone signal at the entrance of the occluded ear canal using a phoneme recognition system R{·}:

(14)

Assuming that the transfer characteristics for each phoneme can be modeled using a (time-invariant) RTF, the RTF for phoneme p′ can be estimated from all frames where this phoneme is detected as

(15)

Hence, the speech-dependent model for talker a consists of P RTFs:

(16)

In the simulation step, first the phoneme sequence pb(l) is determined on the own voice speech of talker b recorded at the microphone at the entrance of the occluded ear canal. For each frame, the corresponding phoneme-specific RTF is selected. In order to prevent discontinuities in the RTFs during phoneme transitions, recursive smoothing with smoothing constant α is applied, i.e.,

(17)

The smoothed RTF is then used to simulate the own voice of talker b at the in-ear microphone:

(18)

Similarly to the speech-independent model, a WOLA scheme is employed to obtain the time-domain signal . Figure 4 depicts the signal flow to simulate in-ear own voice signals for talker b using the speech-dependent model for talker a. Due to the phoneme recognition system for frame-wise phoneme-specific RTF selection, we expect that the proposed speech-dependent model is able to simulate in-ear signals more accurately than the speech-independent model, also for utterances not used during system identification. In addition, it should be realized that unlike the speech-independent model, the speech-dependent model also accounts for speech pauses by modeling them as a separate phoneme.

thumbnail Figure 4

Simulation of in-ear own voice signals for talker b using the proposed speech-dependent model for talker a.

3.3 Talker-averaged models

Since individual models may generalize well to different talkers, we also consider talker-averaged speech-independent and speech-dependent models. In the system identification step, talker-averaged models are obtained by considering all STFT frames of the recorded microphone signals of all utterances from all talkers except talker b (leave-one-out-paradigm) for system identification. The RTFs of the speech-independent talker-averaged model are hence computed as

(19)

while the RTFs of the speech-dependent talker-averaged model for phoneme p′ are computed as

(20)

The simulation step for the talker-averaged models is similar as for the individual models, where for the speech-independent model is used instead of and for the speech-dependent model is used instead of .

3.4 Adaptive filtering-based model

As an alternative to the time-varying speech-dependent model in Section 3.2, in this section we consider a time-domain adaptive filter to model the time-varying transfer path between the microphone at the entrance of the occluded ear canal and the in-ear microphone. The signal flow is illustrated in Figure 5. In the system identification step, the FIR filter with N coefficients is adapted based on recorded microphone signals of an utterance of talker a. The adaptive filter aims at minimizing the error between the in-ear microphone signal and the estimated in-ear own voice signal

(21)

with

(22)

thumbnail Figure 5

The adaptive filtering scheme utilized for estimating in-ear speech signals. The filter coefficients are transferred from system identification to simulation directly after each sample-wise adaptation step.

For adapting the filter the well-known normalized least mean squares (NLMS) algorithm is used [26], i.e., the filter coefficients are recursively updated as

(23)

where μ denotes the step size and ε is a small regularization constant. The model parameters of the adaptive filtering-based model are

(24)

Since this model implicitly depends on a specific utterance, it should be noted that it is not possible to obtain a talker-averaged model by following a similar procedure as described in the previous section.

In the simulation step, the simulated in-ear own voice signal of talker b is computed as

(25)

In case of utterance mismatch (both for the same talker and for a different talker), the filter is applied to a different input signal than used during adaptation which likely results in estimation errors.

4 Experimental evaluation

In this section, the own voice transfer characteristic models discussed in Section 3 are evaluated in terms of their accuracy in simulating in-ear own voice signals for different conditions. In Section 4.1, the data used in the evaluation and the experimental conditions are described. In Section 4.2, the simulation parameters are defined. In Section 4.3, examples of simulated in-ear own voice signals and estimated RTFs are presented for all considered RTF-based models. In Sections 4.44.6, experimental results are presented and discussed for three conditions: matched condition (same talker, same utterance), utterance mismatch and talker mismatch.

4.1 Recording setup and experimental conditions

For identifying and evaluating the own voice transfer characteristic models, we recorded a dataset of own voice speech from 18 native German talkers (5 female, 13 male), with approximately 25–30 min of recorded own voice signals per talker. The hearable device used for recording is the closed-vent variant of the one-size-fits-all Hearpiece [25]. The Hearpiece concha microphone of the device was selected as the microphone at the outer face of the occluded ear canal. Talkers were excluded if insertion of the hearable was not possible, or if bad fittings with insufficient attenuation of external sounds were detected (by measuring a transfer function from an external loudspeaker between the concha and in-ear microphone). For each talker, 306 pre-determined sentences were recorded: The Marburg and Berlin sentences [30], each consisting of 100 sentences, 100 common everyday German sentences for language learners [31], and the German version of the well-known text The North Wind and the Sun, consisting of 6 sentences. Recordings were conducted in a sound-proof listening booth using a Behringer UMC1820 audio interface. Before the recordings started, informed consent was obtained from all talkers. The recorded dataset is publicly available on Zenodo [32]. During system identification, model parameters were estimated on 150 sentences uttered by each talker. During simulation, in-ear own voice signals are generated from the recorded microphone signals at the outer face of the Hearpiece and evaluated per utterance.

Three different simulation conditions are investigated:

4.1.1 Same talker, same utterance (matched condition)

In this condition, the individual RTF-based models and the adaptive filtering-based model are evaluated exactly the same utterances of the same talker (a = b) as considered during model estimation. For the adaptive filtering-based model, this means that the same signal is used during simulation as during identification (see Fig. 5), such that the simulated in-ear signal is equal to the output of the adaptive filter . Talker-averaged models are not considered in this condition.

4.1.2 Same talker, utterance mismatch

In this condition, the individual RTF-based models and the adaptive filtering-based model are evaluated on speech of the same talker (a = b) as considered during model estimation. In order to investigate the generalization ability of the models for the same talker, evaluation is performed on the 156 sentences not used to estimate the models. For the adaptive filtering-based model, the length of the signals used during simulation and identification is matched, either by cutting or concatenating the signals used during model estimation with other signals from the same talker. Talker-averaged models are not considered in this condition.

4.1.3 Talker mismatch

The generalization ability of models to unseen talkers is investigated by estimating speech of talker b using models estimated on a different talker (ab). For each utterance, a random talker a is assigned to talker b. In this condition, there is also an implicit utterance mismatch because the same sentence uttered by different talkers most likely has differences with respect to speed, frequency content, pronunciation and other speech attributes. Talker-averaged models are considered in this condition only. For each talker b, a talker-averaged model is computed from utterances of the remaining 17 talkers. Evaluation is performed on the 156 sentences not used to estimate the models. In all three conditions, Log-Spectral Distance (LSD) [33] and Mel-Cepstral Distance (MCD) [34] between the recorded in-ear signals and the simulated in-ear signals are used as evaluation metrics. For both metrics, a lower value indicates a more accurate estimate. Since perceptual metrics such as Perceptual Evaluation of Speech Quality (PESQ) [35] were found not to correlate well with subjective ratings of body-conducted own voice signals [36], such metrics are not considered in this study.

4.2 Simulation parameters

The experiments were carried out at a sampling frequency of 5 kHz, since above 2.5 kHz the in-ear microphone signals hardly contain any body-conducted speech for the considered hearable device. Model-specific parameters were set empirically based on preliminary experiments. For the RTF-based models, an STFT framework with a frame length of K = 128 (corresponding to 25.6 ms) and an overlap of 50% was used, where a square-root Hann window was utilized both as analysis and synthesis window. For the speech-dependent models, a smoothing parameter of α = 0.8 was used in (17), corresponding to an effective smoothing time of 64 ms. The used phoneme recognition system was trained on German speech and P = 62 phoneme classes. For the adaptive filtering-based model, the filter length was set to N = 128, and a step size parameter μ = 0.5 and regularization constant ε = 10−6 were used in (23). The filter coefficients were initialized as zeroes. For all methods, no voice activity detection was employed so that utterances may contain short pauses.

4.3 Example spectrograms and RTFs

For the RTF-based models, this section presents examples of simulated in-ear own voice signals, spectrograms and estimated RTFs. For the matched condition (same talker, same utterance), Figure 6 for a specific utterance (the beginning of The North Wind and the Sun) of talker 2 (male). The shown spectrograms are the spectrograms of the microphone signal at the entrance of the occluded ear canal and the in-ear microphone signal as well as the in-ear own voice signals simulated with the speech-independent models and the proposed speech-dependent models (individual and talker-average).2 While it can be observed that the speech-independent models estimate the in-ear microphone signal rather well in the frequency region below 500 Hz, they clearly underestimate own voice components for higher frequencies. On the other hand, the speech-dependent models are able to estimate the in-ear microphone signal more accurately at higher frequencies, although deviations are visible above 1 kHz. The estimates of individual and talker-averaged models are very similar for both the speech-independent and speech-dependent models for this example. It should be noted that the low-frequency body-produced noise in the in-ear microphone signal is not present in all simulated in-ear own voice signals. For the same utterance as in Figure 6, Figure 7 depicts the time-domain own voice signal recorded at the entrance of the occluded ear canal with its phoneme annotation, and the magnitude of the phoneme-specific individual RTFs, estimated using (15). Different from other experiments, these RTFs were estimated with a sampling frequency of 16 kHz and an STFT size of N = 256 to show the high-frequency region as well. It can be seen that for different phonemes, the RTFs differ a lot in the low-frequency region below 2.5 kHz, while above 2.5 kHz the RTFs are very similar.

thumbnail Figure 6

Example spectrograms for the same talker, same utterance condition: recorded own voice signal of talker 2 at the entrance of the occluded ear canal (top left) and recorded in-ear own voice signal (top right) of talker 2, and the simulated in-ear own voice signals estimated by the speech-independent individual (middle left) and speech-independent talker-averaged (middle right), and the speech-dependent individual (bottom left) and speech-dependent talker-averaged (bottom right) models.

thumbnail Figure 7

Example own voice signal of talker 2 recorded at the entrance of the occluded ear canal with phoneme annotation (top) and magnitude of phoneme-specific individual relative transfer functions (bottom) estimated on all utterances of this talker (speech-dependent individual model). Only RTF magnitudes of phonemes appearing in the depicted utterance are shown.

To compare the RTF-based models, Figure 8 depicts the estimated RTF magnitudes for the speech-independent models (top subplot) and the speech-dependent models for two selected phonemes (middle and bottom subplot), considering all talkers in the experiments. The individual RTFs are represented by shaded regions and the talker-averaged RTFs as solid lines. Different from the talker-averaged RTFs used in the talker mismatch condition (leave-one-out-paradigm), averages here are computed over all 18 talkers. For the speech-independent RTFs, it can be observed that for most talkers the low frequency region below approximately 600 Hz is amplified at the in-ear microphone relative to the microphone at the entrance of the occluded ear canal, whereas the frequency region above approximately 1.5 kHz is attenuated. While half of the estimated RTFs (i.e., between the quartiles Q1 and Q3) are very similar in magnitude, for some talkers there appear to be larger deviations from the talker-averaged RTF magnitude. For the phoneme-specific RTFs shown in the middle and lower subplot, similar tendencies in terms of inter-individual variance can be observed. However, it can be observed that the phoneme-specific talker-averaged RTFs differ from the speech-independent talker-averaged RTFs. In particular, for the phoneme /Ʒ/ the magnitude is considerably higher than the magnitude of the speech-independent talker-averaged RTF in the frequency region between 500–1.5 kHz and above 2 kHz for the majority of talkers. In contrast, for the phoneme /o/ the RTF magnitudes are lower than the magnitude of the speech-independent talker-averaged RTF especially in the low frequency region.

thumbnail Figure 8

Relative transfer functions estimated for the speech-independent individual and talker-averaged models (top) and for two phonemes with the speech-dependent models (middle and bottom). Values between the quartiles Q1 and Q3 and between the minimum and maximum values of the individual models are indicated by shaded regions. Talker-averaged relative transfer functions over all talkers are shown as solid black lines.

4.4 Same talker, same utterance

For the matched condition (same talker, same utterance), Figure 9 shows the LSD and MCD scores between the recorded in-ear signals and the simulated in-ear signals for the speech-independent and speech-dependent individual RTF-based models and the adaptive filtering-based model. It can be observed that both metrics are much lower for the speech-dependent individual model and the adaptive filtering-based model than for the speech-independent individual model. These results demonstrate that in-ear own voice signals can be simulated more accurately when time-varying or speech-dependent transfer characteristics are accounted for. In addition, the speech-dependent individual model performs nearly as well as the adaptive filtering-based model, where it should be realized that for the matched condition the (utterance-specific) adaptive filter can be considered as the optimal time-varying filter. This indicates that the proposed phoneme-specific RTF-based model is able to accurately model time-varying behavior of own voice transfer characteristics. It can be noted that even in the matched condition, none of the considered methods is able to perfectly simulate the recorded in-ear own voice signals. This can be explained by the fact that the considered methods are not able to account for body-produced noise (see Fig. 6) and possible non-linear effects, which are however assumed to be small.

thumbnail Figure 9

Results for the same talker, same utterance condition with speech-independent (SI), speech-dependent (SD) and adaptive filtering-based (AD) models. Note that the y-axis limits of both subfigures are different. (a) Log-spectral distance; (b) Mel-cepstral distance.

4.5 Same talker, utterance mismatch

For the same models as in the previous section, Figure 10 shows the LSD and MCD score for the utterance mismatch condition (same talker, utterance mismatch). The results for the speech-dependent and speech-independent individual models are very similar in the matched condition (see Fig. 9), indicating that both models generalize well to other utterances of the same talker. For the adaptive filtering-based model, on the other hand, the LSD and MCD scores are much larger than for the matched condition, showing that the utterance-specific adaptive filtering-based method (expectedly) does not generalize well to other utterances.

thumbnail Figure 10

Results for the same talker, utterance mismatch condition with speech-independent (SI), speech-dependent (SD) and adaptive filtering-based (AD) models. Note that the y-axis limits of both subfigures are different. (a) Log-spectral distance; (b) Mel-cepstral distance.

4.6 Talker mismatch

For the talker mismatch condition, Figure 11 shows the LSD and MCD scores for the speech-independent and speech-dependent models (both individual and talker-averaged) and the adaptive filtering-based model. It can be clearly observed that the speech-dependent models outperform the speech-independent models and the adaptive filtering-based model, where the best performance in terms of both metrics is achieved by the speech-dependent talker-averaged model. This indicates that the speech-dependent talker-averaged model has the best generalization ability to unseen talkers. Comparing the results in Figures 10 and 11, it can be observed that the LSD and MCD scores of the speech-dependent individual model are larger under talker mismatch. Especially the large variance of the MCD score is noticeable. Since this effect does not occur in the other conditions, it is likely a consequence of talker mismatch.

thumbnail Figure 11

Results for the talker mismatch condition with speech-independent (SI), speech-dependent (SD) and adaptive filtering-based (AD) models, using individual and talker-averaged (avg.) versions. Note that the y-axis limits of both subfigures are different. (a) Log-spectral distance; (b) Mel-cepstral distance.

5 Discussion

The experiments in Section 4 investigated models of own voice transfer characteristics for simulating in-ear own voice signals. While adaptive filters cannot be used in practice since they are utterance-specific, the proposed speech-dependent RTF-based models are able to generalize to unseen utterances. In case of talker mismatch, the speech-dependent talker-averaged model was more robust than the speech-dependent individual model.

5.1 Limitations

It needs to be realized that due to the usage of a phoneme recognition system, the proposed speech-dependent models exhibit several limitations: First, since the considered phoneme recognition system has been trained with German speech only, the speech-dependent models may not generalize well to other languages, or may require a phoneme recognition system matching these languages. Second, the phoneme recognition system, which is based on a speech recognition system, computes its phoneme annotation when entire words are recognized. This leads to a variable processing delay, typically in the range of several hundred milliseconds to one second. With this phoneme recognition system, the proposed models cannot be used for real-time, low-latency applications, so that a different phoneme recognition system with a lower processing delay may be better suited for these applications. Third, the models are limited to the specific device used to obtain the recorded signals for model estimation. Applying the models to simulate in-ear own voice signals for other devices (e.g., over-ear headphones) would require estimation of RTFs from own voice signals recorded with those devices. Finally, the phoneme-dependent RTFs in the proposed models are estimated for discrete phonemes, and phoneme transitions are handled by temporal smoothing (see Sect. 3.2). However, this approximation may not accurately reflect the actual mouth movements that occur between uttering two phonemes.

5.2 Comparison to previous research

While previous research has addressed simulating in-ear own voice signals, the influence of speech-dependent changes has not been investigated specifically. Earlier studies either focus on speech-independent or black-box DNN models, or are not concerned with simulating in-ear own voice signals. In [7], occlusion effect level differences were modeled for several phonemes using a linear regression model that relates the phoneme formant frequencies to the amount of occlusion in the relevant frequency region below 500 Hz. However, the model in [7] does not allow for the simulation of new in-ear own voice signals. In [24], a DNN model was proposed to convert air-conducted to bone-conducted speech, accounting for individual differences between talkers based on a speaker identification system. While the model was able to generalize to different talkers than those used during training, the role of speech-dependent changes was not investigated. Recently, several DNN-based approaches have been proposed for own voice reconstruction (i.e., reconstruction of own voice speech from hearable microphones) either only using an in-ear microphone or a body-conduction sensor without considering environmental noise [17, 18, 38], or using both a body-conduction sensor and a microphone at the outer face of a hearable while considering environmental noise [39]. To simulate own voice signals for training, these approaches introduce random variations, either by using several RTFs per talker [17, 38] or adding random values to the RTFs [18, 39]. However, the accuracy of these approaches for simulating in-ear own voice signals has not been investigated, and speech-dependent changes were not accounted for in the simulation.

5.3 Applications

Due to their robustness to utterance and talker mismatch, the proposed speech-dependent models may be used, e.g., to simulate in-ear own voice signals as training data for DNN-based algorithms aiming at joint bandwidth extension, equalization, and noise reduction of own voice signals recorded at an in-ear or body-conduction microphone. In this application, a large amount of own voice signals is typically required to train DNNs. The proposed models may be beneficial for these applications, as they may be used to simulate in-ear own voice signals from broadband speech signals. Speech-independent models have already been used for this purpose in [17, 38, 39]. Since in-ear or body-conduction microphones are also beneficial for speech recognition systems (see e.g., [40]), the proposed speech-dependent models could be applied to training an own voice speech recognition system by simulating training data.

6 Conclusion

In this paper, speech-dependent models of own voice transfer characteristics in hearables have been proposed. The models can be utilized to estimate own voice signals at an in-ear microphone. In particular, the proposed models take into account time-varying speech-dependent behavior and inter-individual differences between talkers. To estimate in-ear own voice signals from broadband speech using the proposed speech-dependent models, phoneme-specific RTFs are used. The influence of utterance and talker mismatch on the estimation accuracy of in-ear own voice signals has been investigated in an experimental evaluation. Results show that using a speech-dependent model is beneficial compared to using a speech-independent model. Although the adaptive filtering-based approach is able to model the speech-dependency of the own voice transfer characteristics well in the matched condition, it completely fails when considering utterance and talker mismatch. However, the proposed individual speech-dependent models are able to generalize to different utterances of the same talker. Talker-averaged models were shown to generalize better to different talkers than individual models. Future work will investigate the usage of the proposed models for simulating in-ear signals to train own voice reconstruction algorithms based on supervised learning.

Acknowledgments

The Oldenburg Branch for Hearing, Speech and Audio Technology HSA is funded in the program Vorab by the Lower Saxony Ministry of Science and Culture (MWK) and the Volkswagen Foundation for its further development. This work was partly funded by the German Ministry of Science and Education BMBF FK 16SV8811 and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project ID 352015383 – SFB 1330 C1. The authors wish to thank the talkers for their participation in the recordings.

Conflicts of interest

The authors declare no conflict of interest.

Data availability statement

The research data associated with this article are available in Zenodo, under the reference [32]. Supplemental material (listening examples) is available in Zenodo, under the reference [37].

Data privacy management

All subjects who participated in the recordings were informed about data collection and future data use, and gave informed consent.


1

Circular convolutions effects are also neglected in this approximation, but can be reduced by appropriate windowing.

2

Audio examples corresponding to the spectrograms are available online at https://doi.org/10.5281/zenodo.11371976 [37].

References

  1. R.E. Bouserhal, A. Bernier, J. Voix: An in-ear speech database in varying conditions of the audio-phonation loop, Journal of the Acoustical Society of America 145, 2 (2019) 1069–1077. [CrossRef] [PubMed] [Google Scholar]
  2. M.Ø. Hansen: Occlusion effects part I and II, PhD thesis, Department of Acoustic Technology, Technical University of Denmark, 1998. [Google Scholar]
  3. S. Stenfelt, S. Reinfeldt: A model of the occlusion effect with bone-conducted stimulation, International Journal of Audiology 46, 10 (2007) 595–608. [Google Scholar]
  4. S. Vogl, M. Blau: Individualized prediction of the sound pressure at the eardrum for an earpiece with integrated receivers and microphones, Journal of the Acoustical Society of America 145, 2 (2019) 917–930. [CrossRef] [PubMed] [Google Scholar]
  5. S. Reinfeldt, P. Östli, B. Håkansson, S. Stenfelt: Hearing one’s own voice during phoneme vocalization – transmission by air and bone conduction, Journal of the Acoustical Society of America 128, 2 (2010) 751–762. [CrossRef] [PubMed] [Google Scholar]
  6. H. Saint-Gaudens, H. Nélisse, F. Sgard, O. Doutres: Towards a practical methodology for assessment of the objective occlusion effect induced by earplugs, Journal of the Acoustical Society of America 151, 6 (2022) 4086–4100. [CrossRef] [PubMed] [Google Scholar]
  7. T. Zurbrügg, A. Stirnemannn, M. Kuster, H. Lissek: Investigations on the physical factors influencing the ear canal occlusion effect caused by hearing aids, Acta Acustica united with Acustica 100, 3 (2014) 527–536. [Google Scholar]
  8. J. Richard, V. Zimpfer, S. Roth: Effect of bone-conduction microphone location and mouth opening on transfer function between oral cavity sound pressure and skin acceleration, in: Proceedings of Convention of the European Acoustics Association (Forum Acusticum), Turin, Italy, 11–15 September, 2023, pp. 4725–4732. [Google Scholar]
  9. C. Pörschmann: Influences of bone conduction and air conduction on the sound of one’s own voice, Acta Acustica united with Acustica 86, 6 (2000) 1038–1045. [Google Scholar]
  10. M.K. Brummund, F. Sgard, Y. Petit, F. Laville: Three-dimensional finite element modeling of the human external ear: simulation study of the bone conduction occlusion effect, Journal of the Acoustical Society of America 135, 3 (2014) 1433–1444. [CrossRef] [PubMed] [Google Scholar]
  11. S. Liebich, J. Fabry, P. Jax, P. Vary: Signal processing challenges for active noise cancellation headphones, in: Proceedings of 13th ITG-Symposium on Speech Communication, Oldenburg, Germany, 10–12 October 2018, VDE, pp. 11–15. [Google Scholar]
  12. P. Rivera Benois, R. Roden, M. Blau, S. Doclo: Optimization of a fixed virtual sensing feedback ANC controller for in-ear headphones with multiple loudspeakers, in: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May, 2022, IEEE, 8717–8721. [Google Scholar]
  13. T. Zurbrügg: The occlusion effect – measurements, simulations and countermeasures, in: Proceedings of 13th ITG-Symposium on Speech Communication, Oldenburg, Germany, 10–12 October, 2018, VDE, pp. 26–30. [Google Scholar]
  14. S. Liebich, P. Vary: Occlusion effect cancellation in headphones and hearing devices – the sister of active noise cancellation, IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022) 35–48. [CrossRef] [Google Scholar]
  15. R.E. Bouserhal, T.H. Falk, J. Voix: In-ear microphone speech quality enhancement via adaptive filtering and artificial bandwidth extension, Journal of the Acoustical Society of America 141, 3 (2017) 1321–1331. [CrossRef] [PubMed] [Google Scholar]
  16. H. Wang, X. Zhang, D. Wang: Fusing bone-conduction and air-conduction sensors for complex-domain speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022) 3134–3143. [CrossRef] [PubMed] [Google Scholar]
  17. M. Ohlenbusch, C. Rollwage, S. Doclo: Training strategies for own voice reconstruction in hearing protection devices using an in-ear microphone, in: Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC), Bamberg, Germany, 05–08 September, 2022, IEEE. [Google Scholar]
  18. J. Hauret, T. Joubaud, V. Zimpfer, É. Bavu: Configurable EBEN: extreme bandwidth extension network to enhance body-conducted speech capture, IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023) 3499–3512. [CrossRef] [Google Scholar]
  19. M. Ohlenbusch, C. Rollwage, S. Doclo: Multi-microphone noise data augmentation for DNN-based own voice reconstruction for hearables in noisy environments, in: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, South Korea, 14–19 April, 2024, IEEE, pp. 416–420. [Google Scholar]
  20. V. Panayotov, G. Chen, D. Povey, S. Khudanpur: Librispeech: an ASR corpus based on public domain audio books, in: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April, 2015, IEEE, pp. 5206–5210. [Google Scholar]
  21. T. Ko, V. Peddinti, D. Povey, M.L. Seltzer, S. Khudanpur, A study on data augmentation of reverberant speech for robust speech recognition, in: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 05-09 March, 2017, IEEE, pp. 5220–5224. [Google Scholar]
  22. W. He, P. Motlicek, J.-M. Odobez: Neural network adaptation and data augmentation for multi-speaker direction-of-arrival estimation, IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021) 1303–1317. [CrossRef] [Google Scholar]
  23. P. Srivastava, A. Deleforge, E. Vincent: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators, in: Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC), Bamberg, Germany, 05–08 September, 2022, IEEE. [Google Scholar]
  24. M. Pucher, T. Woltron: Conversion of airborne to bone-conducted speech with deep neural networks, in: Proceedings of Interspeech, Brno, Czechia, August, 2021, pp. 1–5. [Google Scholar]
  25. F. Denk, M. Lettau, H. Schepker, S. Doclo, R. Roden, M. Blau, J.-H. Bach, J. Wellmann, B. Kollmeier: A one-size-fits-all earpiece with multiple microphones and drivers for hearing device research, in: Proceedings of AES International Conference on Headphone Technology, San Francisco, USA, 27–29 August, 2019, AES. [Google Scholar]
  26. S. Haykin: Adaptive filter theory, 3rd edn., Prentice Hall, 1996. [Google Scholar]
  27. M. Ohlenbusch, C. Rollwage, S. Doclo: Speech-dependent modeling of own voice transfer characteristics for in-ear microphones in hearables, in: Proceedings of Convention of the European Acoustics Association (Forum Acusticum), Turin, Italy, 11–15 September, 2023, pp. 1899–1902. [Google Scholar]
  28. L. Ljung: System identification, in: A. Procházka, J. Uhlíř, P.W.J. Rayner, N.G. Kingsbury (Eds.), Signal analysis and prediction: applied and numerical harmonic analysis, Springer, 1998, pp. 163–173. [Google Scholar]
  29. Y. Avargel, I. Cohen: On multiplicative transfer function approximation in the short-time Fourier transform domain, IEEE Signal Processing Letters 14, 5 (2007) 337–340. [CrossRef] [Google Scholar]
  30. A.P. Simpson, K.J. Kohler, T. Rettstadt: The Kiel corpus of read/spontaneous speech: acoustic data base, processing tools, and analysis results, Arbeitsberichte Institut für Phonetik und Digitale Sprachverarbeitung Universität Kiel 32 (1997) 243–247. [Google Scholar]
  31. A. Neustein, 100 Sätze reichen für ein ganzes Leben (Blog-post), August, 2019. Available at https://deutschlernerblog.de/100-saetze-reichen-fuer-ein-ganzes-leben/. [Google Scholar]
  32. M. Ohlenbusch, C. Rollwage, S. Doclo: German own voice recordings with hearable microphones, Zenodo, 2024. https://doi.org/10.5281/zenodo.10844598. [Google Scholar]
  33. A. Gray, J. Markel: Distance measures for speech processing, IEEE Transactions on Acoustics, Speech, and Signal Processing 24, 5 (1976) 380–391. [CrossRef] [Google Scholar]
  34. R.F. Kubichek: Mel-cepstral distance measure for objective speech quality assessment, in: Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, Victoria, BC, Canada, 19–21 May, 1993, IEEE, pp. 125–128. [Google Scholar]
  35. International Telecommunications Union (ITU): ITU-T P.862, Perceptual Evaluation of Speech Quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, International Telecommunications Union, 2001. Available at https://www.itu.int/rec/T-REC-P.862. [Google Scholar]
  36. J. Richard, V. Zimpfer, S. Roth: Comparison of objective and subjective methods for evaluating speech quality and intelligibility recorded through bone conduction and in-ear microphones, Applied Acoustics 211 (2023) 109576. [CrossRef] [Google Scholar]
  37. M. Ohlenbusch, C. Rollwage, S. Doclo: Modeling of speech-dependent own voice transfer characteristics for hearables with in-ear microphones: audio examples, Zenodo, 2024. https://doi.org/10.5281/zenodo.11371976. [Google Scholar]
  38. A. Edraki, W.-Y. Chan, J. Jensen, D. Fogerty: Speaker adaptation for enhancement of bone-conducted speech, in: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, South Korea, 14–19 April, 2024, IEEE, pp. 10456–10460. [Google Scholar]
  39. L. He, H. Hou, S. Shi, X. Shuai, Z. Yan: Towards bone-conducted vibration speech enhancement on head-mounted wearables, in: Proceedings of 21st Annual International Conference on Mobile Systems, Applications and Services, Helsinki, Finland, 18–22 June, 2023, Association for Computing Machinery, pp. 14–27. [Google Scholar]
  40. M. Wang, J. Chen, X.-L. Zhang, S. Rahardja: End-to-end multi-modal speech recognition on an air and bone conducted speech corpus, IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023) 513–524. [CrossRef] [Google Scholar]

Cite this article as: Ohlenbusch M. Rollwage C. & Doclo S. 2024. Modeling of speech-dependent own voice transfer characteristics for hearables with an in-ear microphone. Acta Acustica, 8, 28.

All Figures

thumbnail Figure 1

The own voice signal model for a hearable with two microphones (outer face, in-ear).

In the text
thumbnail Figure 2

Overview of the system identification and simulation steps of the own voice transfer characteristic models.

In the text
thumbnail Figure 3

Simulation of in-ear own voice signals for talker b using the speech-independent model for talker a.

In the text
thumbnail Figure 4

Simulation of in-ear own voice signals for talker b using the proposed speech-dependent model for talker a.

In the text
thumbnail Figure 5

The adaptive filtering scheme utilized for estimating in-ear speech signals. The filter coefficients are transferred from system identification to simulation directly after each sample-wise adaptation step.

In the text
thumbnail Figure 6

Example spectrograms for the same talker, same utterance condition: recorded own voice signal of talker 2 at the entrance of the occluded ear canal (top left) and recorded in-ear own voice signal (top right) of talker 2, and the simulated in-ear own voice signals estimated by the speech-independent individual (middle left) and speech-independent talker-averaged (middle right), and the speech-dependent individual (bottom left) and speech-dependent talker-averaged (bottom right) models.

In the text
thumbnail Figure 7

Example own voice signal of talker 2 recorded at the entrance of the occluded ear canal with phoneme annotation (top) and magnitude of phoneme-specific individual relative transfer functions (bottom) estimated on all utterances of this talker (speech-dependent individual model). Only RTF magnitudes of phonemes appearing in the depicted utterance are shown.

In the text
thumbnail Figure 8

Relative transfer functions estimated for the speech-independent individual and talker-averaged models (top) and for two phonemes with the speech-dependent models (middle and bottom). Values between the quartiles Q1 and Q3 and between the minimum and maximum values of the individual models are indicated by shaded regions. Talker-averaged relative transfer functions over all talkers are shown as solid black lines.

In the text
thumbnail Figure 9

Results for the same talker, same utterance condition with speech-independent (SI), speech-dependent (SD) and adaptive filtering-based (AD) models. Note that the y-axis limits of both subfigures are different. (a) Log-spectral distance; (b) Mel-cepstral distance.

In the text
thumbnail Figure 10

Results for the same talker, utterance mismatch condition with speech-independent (SI), speech-dependent (SD) and adaptive filtering-based (AD) models. Note that the y-axis limits of both subfigures are different. (a) Log-spectral distance; (b) Mel-cepstral distance.

In the text
thumbnail Figure 11

Results for the talker mismatch condition with speech-independent (SI), speech-dependent (SD) and adaptive filtering-based (AD) models, using individual and talker-averaged (avg.) versions. Note that the y-axis limits of both subfigures are different. (a) Log-spectral distance; (b) Mel-cepstral distance.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.