Open Access
Issue
Acta Acust.
Volume 7, 2023
Article Number 18
Number of page(s) 6
Section Hearing, Audiology and Psychoacoustics
DOI https://doi.org/10.1051/aacus/2023016
Published online 24 May 2023

© The Author(s), published by EDP Sciences, 2023

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

Nonspeech audio and speech signal distortions appear in many communication systems, e.g., in telephones, headphones, and hearing aids. The causes of these distortions are manifold, e.g., bandwidth limitations, audio codecs, internal noise, dynamic range compression, or noise reduction artifacts [14]. Many psychoacoustical experiments rely on a controllable simulation of these distortions to examine human perception of distorted audio signals. Two widely used types of distortion simulations are hard peak clipping and the Modulated Noise Reference Unit (MNRU) [5]. The purpose of this study is to present and validate an alternative distortion simulation method which represents typical speech processing artifacts. We concentrate on “musical” noise distortions caused by a spectral subtraction algorithm, but in general this simulation method could be applied to any speech processing algorithm which does not alter the phase of the original signal.

Hard peak clipping is based on limiting the amplitude of a signal to a specific threshold, creating flat cutoffs. In the frequency domain, these flat cutoffs generate many high-frequency harmonics, which can be perceived as disturbing. The level of created distortions is controlled by the threshold, i.e., a lower threshold leads to more distortions (see, e.g., [6]).

MNRU is based on the summation of the original speech signal and random noise modulated by the original signal. The level of distortions is controlled by the signal-to-noise ratio (SNR) of the speech and modulated noise signal. In research, MNRU is widely used to simulate signal degradations commonly found in telecommunication scenarios (see, e.g., [7]).

Despite their common use, both methods do not adequately simulate distortions typically created by noise reduction algorithms, such as the commonly used spectral subtraction type. Spectral subtraction algorithms randomly create isolated peaks in the spectrum of the output signal. These peaks are perceived similarly to short musical tones; hence the created distortions are called “musical” noise, which the above-mentioned methods fail to mimic. The amount of musical noise distortions and residual noise in the output signal of these algorithms is difficult to disentangle and, hence, to control. The aim of this study is to develop a method, which allows adding musical noise to a clean speech signal in a controlled way without creating residual noise. The goal is to produce “musical” noise artifacts that are more relevant for speech processing systems than existing ways of systematically adding distortions, and to be able to control these distortions very well with respect to the subjective impression they elicit. The method is evaluated subjectively with participant testing using rating scales, and objectively via model predictions. We investigate the perceived listening effort on a 13-point scale ranging from “no effort” to “extreme effort” [8] and the perceived sound quality using a continuous 5-point scale ranging from “bad” to “excellent” [9]. Specifically, the following research questions are addressed:

  • Can we find parameters so that the added musical noise distortions

    • Evoke listening effort ratings covering the entire rating scale from “no effort” to “extreme effort”?

    • Lead to sound quality ratings covering the entire rating scale from “excellent” to “bad”?

  • Can we model the subjectively measured perceived listening effort?

2 Methods

2.1 Adding musical noise to a signal

Spectral subtraction noise reduction algorithms estimate speech from a noisy speech signal according to [10, 11]:

Ŝ(f)=max(|X(f)|2-k|Ŵ(f)|2,0)eiϕX(f)$$ \widehat{{S}}\left({f}\right)=\sqrt{\mathbf{max}\left({\left|{X}\left({f}\right)\right|}^{\mathbf{2}}-{k}{\left|\widehat{{W}}\left({f}\right)\right|}^{\mathbf{2}},\mathbf{0}\right)}\bullet {{e}}^{{i}{{\phi }}_{{X}}\left({f}\right)} $$(1)where f denotes frequency, Ŝ(f)$ \widehat{{S}}({f})$ is the frequency-dependent estimated speech signal spectrum, X(f) is the mixed signal spectrum (i.e., speech plus noise), Ŵ(f)$ \widehat{{W}}({f})$ is the estimated noise signal spectrum, ϕX (f) is the phase of the mixed signal, and the parameter k ≥ 0 controls the subtraction (k = 0 leaves the mixed signal untouched). The estimated speech time signal can be obtained by transforming the spectrum into the time domain. In practice, the noise estimate is imperfect and, hence, the estimated speech signal contains the clean speech signal, residual noise, and random musical noise distortions. The algorithm parameters can be tuned such that noise is strongly suppressed, which typically results in strong musical tones, or vice versa.

For the purpose of creating controlled simulations of distortions, it would be important to remove the residual noise from the distorted speech signal. To achieve this, we applied the method of Hagerman and Olofsson [12] and created two mixed signals as follows:

x+(n)=s(n)+w(n), x-(n)=s(n)-w(n),$$ {{x}}_{+}\left({n}\right)={s}\left({n}\right)+{w}\left({n}\right),\enspace \hspace{1em}{{x}}_{-}\left({n}\right)={s}\left({n}\right)-{w}\left({n}\right), $$(2)where x+(n) and x(n) are mixed signals, s(n) is the speech signal and w(n) is the noise signal, all in the discrete time domain, i.e., as function of the sampled time n. We used Gaussian noise with a lower band limit of 20 Hz and an upper band limit of 8 kHz to mask speech sampled at 16 kHz. With equation (1) we calculated one estimated speech signal for each mixed signal of equation (2). By summing the two estimated speech signals, the residual noises cancel out and we obtain the clean speech signal with musical noise distortions. The amount of musical noise distortions can be manipulated by adjusting the SNR of the two mixed signals in equation (2), and by adjusting the parameter k in equation (1).

We quantified the amount of added musical noise distortions with the Perceptual Similarity Measure (PSM) of the PEMO-Q model [4]. A PSM value of 1 indicates no perceivable difference between reference and test signal. The lower the PSM, the more perceivable the difference, i.e., the more musical noise distortions were added to the reference signal. As reference speech signal we used five concatenated sentences of the Oldenburg German matrix sentence test (OLSA) procedure [1315]. Each sentence had the same syntactic structure: name, verb, numeral, adjective, and object. For every word type, ten well-known German words were available which could be randomly combined to produce syntactically correct but semantically unpredictable sentences.

To determine the impact of different combinations of SNR and k values on the PSM values, we added musical noise distortions with SNR values in the interval of [−100 dB, 0 dB] in steps of 2 dB and k values in the interval of [0, 5] in steps of 0.1, because k = 5 is suggested as maximum value in the literature [16]. To account for the randomness with which the musical noise distortions were added to the signal, we repeated the PSM calculation ten times for every SNR and k value. With the help of curve fits, we calculated ten SNR and k values to be used in the evaluations.

2.2 Evaluation

For the subjective evaluation, we invited participants to assess the perceived listening effort and perceived sound quality for different amounts of added musical noise distortions. Prior to these listening tasks, the individual presentation level corresponding to medium loud was determined by adaptive categorical loudness scaling [17, 18] using the software framework of Ewert [19]. All subsequent listening task stimuli were presented at this level.

2.2.1 Experimental setup

Participants sat in a small soundproof measurement chamber wearing open headphones (Sennheiser HD 650). The stimuli were presented diotically. The signal processing was done on a PC outside of the chamber while an RME Fireface UC sound card was used as an interface between the PC and the headphones. A 22” touch screen (iiyama prolite t2252msc) was placed in front of the participants.

2.2.2 Participants

Twenty-four native German-speaking, younger, normal-hearing persons participated in this study (13 females, 11 males). Their ages ranged from 21 to 36 years. The participants’ better ear pure-tone average (PTA4, calculated as the average hearing threshold across the four frequencies 500 Hz, 1 kHz, 2 kHz, and 4 kHz) covered a range from -6.2 dB HL to 7.5 dB HL. Participants were recruited on a voluntary basis and were paid a small allowance for their participation. Every participant was informed about the study and signed consent forms. Ethical approval was provided by the ethics committee of the University of Oldenburg (Drs.EK/2019/073-02).

2.2.3 Perceived listening effort

Perceived listening effort was measured using categorical listening effort ratings of 40 sound signals. The sound signals consisted of 40 different OLSA sentences: ten different PSM values each repeated four times. The sound signals were randomized and rated according to perceived listening effort on a 13-point scale as mentioned in the introduction, where the category “no effort” corresponded to a listening effort of 1 Effort Scale Categorical Unit (ESCU) and the category “extreme effort” corresponded to a listening effort of 13 ESCU. An additional category “no speech” was included to catch the stimuli in which participants could not identify any speech and, hence, could not reasonably assess the effort associated with recognizing the target speech [8]. Curves (listening effort as function of the PSM value) were fitted to the individual data points following the fitting method described by Krueger et al. [8].

In prospect of designing an adaptive listening effort scaling method in a future study, we determined the PSM value difference (∆PSM) leading on average to a listening effort difference of 1 ESCU. Similar to [8], this ∆PSM value could be used as step size in a first phase to explore the available PSM space.

2.2.4 Perceived sound quality

Perceived sound quality was measured using continuous sound quality ratings of the same 40 sound signals as used for the perceived listening effort ratings. The sound signals were randomized and rated according to perceived quality of the speech on a 5-point scale as mentioned in the introduction. The rating “excellent” corresponded to a Mean Opinion Score (MOS) of 5 while the rating “bad” corresponded to a MOS of 1 [9].The participants selected a rating by continuously moving a slider between these two boarder ratings.

As already shown by Huber et al. [4], the relationship between the PSM and subjectively perceived sound quality ratings is not necessarily linear but may exhibit floor and ceiling effects (i.e., data points with the minimum and maximum rating scale value, respectively). For this reason, the individual data points were parted into four equally spaced segments in the PSM domain. A linear regression curve (sound quality as function of the PSM value) was fitted to the data points in every segment, so that one connecting line over all segments was created. This fitting method allowed to account for possible floor and ceiling effects (i.e., horizontal lines in the first and last segment, respectively).

2.2.5 Objective evaluation

In addition to the subjective evaluations, we used the non-intrusive LEAP model [20] to objectively evaluate the listening effort caused by the musical noise distortions. This model is based on uncertainty measures of phoneme classifications in an automatic speech recognition engine. It has been previously employed to predict the detrimental effects of various type of broadcast background sounds [21], combined and isolated noise and reverberation [22], and the beneficial effects of non-linear speech enhancement on listening effort [23]. The model was employed here without any adaptation.

3 Results

3.1 Pre-evaluation considerations

The PSM values calculated according to Section 2.1 ranged from 0.45 to 1.00 with a maximum standard deviation across the ten repetitions of 0.03. We fitted sigmoid curves to the calculated PSM values for each k value in dependency of the SNR as follows:

PSM(SNR)=C1+C2-C11+eC3-SNRC4,$$ \mathbf{PSM}\left(\mathbf{SNR}\right)={{C}}_{\mathbf{1}}+\frac{{{C}}_{\mathbf{2}}-{{C}}_{\mathbf{1}}}{\mathbf{1}+{{e}}^{\frac{{{C}}_{\mathbf{3}}-\mathbf{SNR}}{{{C}}_{\mathbf{4}}}}}, $$(3)where C1 represents the minimum value, C2 represents the maximum value, C3 represents the shift of the inflection point from the SNR = 0 dB coordinate, and C4 represents the shape parameter of the sigmoid curve. The calculated PSM values and the sigmoid curve fits are illustrated for integer k values in Figure 1.

thumbnail Figure 1

Calculated PSM values (open symbols) and sigmoid curve fits (solid lines) as function of the SNR for integer k values (different colors). The red bars separate the plateau, i.e., the lowest 5% PSM value range, from the upper 95% PSM value range for each value k > 0.

To calculate SNR and k values to be used in the evaluation, we utilized the simplex search method of Lagarias et al. [24] to find combinations of SNR and k values leading to ten linearly spaced PSM values ranging from the minimum to the maximum PSM value according to formula (3). In addition, we added the constraint, that with increasing PSM value the SNR value must increase and the k value must decrease, respectively. Since the PSM values decreased very slowly for very low SNR values (sigmoid curves approach their asymptote C1 for SNR ≪ C1, see Fig. 1), we excluded PSM values in the lowest 5% of the PSM value range for each k value in the search algorithm (PSM > C1 + 5%∙(C2 − C1)). The corresponding separation points are illustrated by red bars in Figure 1. In this way we avoided jumps in SNR value when searching for linearly spaced PSM values. The resulting SNR, k, and PSM values are summarized in Table 1. Stimuli created with these ten parameter sets were then used in the further evaluations.

Table 1

SNR and k values resulting in ten linearly spaced PSM values.

3.2 Relationship of listening effort and musical noise distortions

The curves fitted to the individual listening effort data, the mean curve, and the curve fitted to the corresponding data predicted by the LEAP model are illustrated in Figure 2.

thumbnail Figure 2

Listening effort as function of the PSM, curve fits to the individual rating data for each participant (light gray lines), the mean curve (black line), and the LEAP model prediction (dashed black line) are shown.

As expected, perceived listening effort decreased with increasing PSM, i.e., decreasing distortions. The PSM values of the individual curves ranged from 0.57 for 13 ESCU to 0.99 for 1 ESCU covering the whole available listening effort scale. The mean rating behavior of the participants was well predicted by the LEAP model. The maximum difference between the mean and the LEAP curve was ∆PSM = 0.12 at 6 ESCU. Overall, we found a strong and significant correlation between the individual fitted PSM values and the model predicted PSM values (r = 0.95, p < 0.01). On average, a difference in PSM value of 0.03 led to a difference in perceived listening effort of 1 ESCU. We therefore define the step size ∆PSM value as 0.03.

3.3 Relationship of sound quality and musical noise distortions

The curves fitted to the individual sound quality data and the mean curve are illustrated in Figure 3.

thumbnail Figure 3

MOS as function of the PSM, curve fits to the individual rating data for each participant (light gray lines) and the mean curve (black line) are shown. The four fitting segments are illustrated by dashed black lines.

The data revealed a strong floor effect: in the first segment (PSM < 0.64) the mean curve had the smallest slope of all segments (slope = 1.4). After the first segment, the mean MOS continued to increase over the segments two, three, and four with a slope of 6.7, 14.9, and 9.4 respectively. Of all participants, participant number 2 showed the strongest floor effect, the MOS increased only after the second segment (PSM > 0.76). In addition to the floor effect, participant number 1 showed a strong ceiling effect, the MOS equaled 5 in the fourth segment (PSM > 0.88).

4 Discussion and conclusions

The results of our study indicate that the ten identified PSM values cover the whole available listening effort scale from “extreme effort” to “no effort” needed to follow the speech with the appropriate musical noise distortion as specified in Section 2. The proposed method therefore is suitable to systematically introduce musical noise distortions with perceptual consequences covering the entire listening effort range. The data indicated that a difference of 1 ESCU corresponded to a PSM change of 0.03. We therefore propose this to serve as a minimum step size value in future studies employing this procedure. It is important to note that this step size refers to the specific conditions used in this evaluation study and may differ for other conditions.

The participants’ listening effort ratings closely followed the LEAP model predictions, indicating that LEAP is applicable to quantitatively predicting the perceptual impact of this type of distortions. This is notable because the model was not adapted in any way, which suggests that it may be useful to predict the perceptual impacts of a wide range of speech disturbances [2123].

With respect to sound quality ratings, the ten identified PSM values cover the whole available sound quality scale from “bad” to “excellent”. In other words, the proposed method is also suitable to produce all quality categories that may be required in, e.g., system evaluation studies. The data show a strong floor effect, i.e., sound quality ratings are on average insensitive for PSM values between 0.52 and 0.64. This floor effect was already observed for audio codec distortions in [4]. We therefore suggest considering PSM values greater than 0.64 for sound quality ratings. However, two individual data sets showed strong deviations from the mean, i.e., even though choosing the suggested PSM value range, floor and/or ceiling effects could be observed. It is important to note that four participants rated the undistorted sentences (i.e., PSM = 1) with a MOS smaller than the mean value (<4.9). A possible reason for this could be that the signals had a limited bandwidth (sample frequency of 16 kHz), which the normal-hearing participants rated as not “excellent” anymore.

With its necessary simplifications, the study has inherent limitations. A specific set of conditions were used in the evaluation method: distorted OLSA sentences presented with medium-loud level. In how far the results can be generalized on different speech material (e.g., effortful speech) or different presentation level (e.g., loud levels) is unclear. Only young normal-hearing participants were invited to the experiment, for which reason the results are not applicable to, e.g., hearing-impaired persons. In how far a hearing loss and/or amplification has an influence on the ratings of musical noise distortions should be the subject of future research.

Acknowledgments

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC 2177/1 – Project ID 390895286. The authors thank Thomas Brand for very valuable suggestions regarding the data analysis and Julia Thomas for conducting the participant tests.

Data Availability Statement

The code of the proposed method and some example audio files are available as supplementary material at https://github.com/goesjon/addMusicalNoise [25].

References

  1. I. Brons, W.A. Dreschler, R. Houben: Detection threshold for sound distortion resulting from noise reduction in normal-hearing and hearing-impaired listeners. Journal of the Acoustical Society of America 136, 3 (2014) 1375–1384. [CrossRef] [PubMed] [Google Scholar]
  2. P. Kendrick, I.R. Jackson, F.F. Li, T.J. Cox, B.M. Fazenda: Perceived Audio Quality of Sounds Degraded by Perceived audio quality of sounds degraded by non-linear distortions and single-ended assessment using HASQI. Journal of the Audio Engineering Society 63, 9 (2015) 698–712. [CrossRef] [Google Scholar]
  3. J. Agnew: The causes and effects of distortion and internal noise in hearing aids. Trends in Amplification 3, 3 (1998) 82–118. [CrossRef] [PubMed] [Google Scholar]
  4. R. Huber, B. Kollmeier: PEMO-Q – a new method for objective audio quality assessment using a model of auditory perception. IEEE Transactions on Audio, Speech, and Language Processing 14, 6 (2006) 1902–1911. [CrossRef] [Google Scholar]
  5. ITU-T P.810: Modulated Noise Reference Unit (MNRU). International Telecommunication Union, Feb. 1996. [Google Scholar]
  6. A.M. Kubiak, J. Rennies, S.D. Ewert, B. Kollmeier: Relation between hearing abilities and preferred playback settings for speech perception in complex listening conditions. International Journal of Audiology 2021 (2021) 1–10. [Google Scholar]
  7. A. Takahashi, H. Yoshino, N. Kitawaki: Perceptual QoS assessment technologies for VoIP. IEEE Communications Magazine 42, 7 (2004) 28–34. [CrossRef] [Google Scholar]
  8. M. Krueger, M. Schulte, T. Brand, I. Holube: Development of an adaptive scaling method for subjective listening effort. Journal of the Acoustical Society of America 141, 6 (2017) 4680–4693. [CrossRef] [PubMed] [Google Scholar]
  9. ITU-T P.800: Methods for subjective determination of transmission quality. International Telecommunication Union, Aug. 1996. [Google Scholar]
  10. J. Thiemann: Acoustic noise supression for speech signals using auditory masking effects. Master Thesis. Department for Electrical & Computer Engineering, McGill University, Montreal, Canada. 2001. Accessed: Sep. 29 2022. [Online]. Available: https://escholarship.mcgill.ca/concern/theses/vm40xt64p. [Google Scholar]
  11. S. Boll: Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing 27, 2 (1979) 113–120. [CrossRef] [Google Scholar]
  12. B. Hagerman, A. Olofsson: A method to measure the effect of noise reduction algorithms using simultaneous speech and noise. Acta Acustica United with Acustica 90, 2 (2004) 356–361. [Google Scholar]
  13. K. Wagener, T. Brand, B. Kollmeier: Development and evaluation of a German sentence test Part II: Optimization of the Oldenburg sentence test. Zeitschrift Für Audiologie 38, 2 (1999) 44–56. [Google Scholar]
  14. K. Wagener, T. Brand, B. Kollmeier: Development and evaluation of a German sentence test Part III: Evaluation of the Oldenburg sentence test. Zeitschrift Für Audiologie 38, 3 (1999) 86–95. [Google Scholar]
  15. K. Wagener, V. Kühnel, B. Kollmeier: Development and evaluation of a German sentence test. Part I: Design of the Oldenburg sentence test. Zeitschrift Für Audiologie 38, 1 (1999) 4–15. [Google Scholar]
  16. M. Berouti, R. Schwartz, J. Makhoul: Enhancement of speech corrupted by acoustic noise. IEEE International Conference on Acoustics, Speech, and Signal Processing. 1979, 208–2011. https://doi.org/10.1109/ICASSP.1979.1170788. [Google Scholar]
  17. T. Brand, V. Hohmann: An adaptive procedure for categorical loudness scaling. Journal of the Acoustical Society of America 112, 4 (2002) 1597–1604. [CrossRef] [PubMed] [Google Scholar]
  18. D. Oetting, T. Brand, S.D. Ewert: Optimized loudness-function estimation for categorical loudness scaling data. Hearing Research 316 (2014) 16–27. [CrossRef] [PubMed] [Google Scholar]
  19. S.D. Ewert: AFC-A modular framework for running psychoacoustic experiments and computational perception models, in Proceedings of the International Conference on Acoustics AIA-DAGA, 18–21 March 2023, Merano, Italy. 2013, 1326–1329. [Google Scholar]
  20. R. Huber, M. Krüger, B.T. Meyer: Single-ended prediction of listening effort using deep neural networks. Hearing Research 359 (2018) 40–49. [CrossRef] [PubMed] [Google Scholar]
  21. R. Huber, H. Baumgartner, V.N. Krishnan, S. Goetze, J. Rennies-Hochmuth: Single-ended prediction of listening effort for English speech, in DAGA 2020 – 46th Annual Meeting for Acoustics, 16–19 March 2023, Hannover, Germany. 2020, 774–777. [Google Scholar]
  22. J. Rennies, S. Röttges, R. Huber, C.F. Hauth, T. Brand: A joint framework for blind prediction of binaural speech intelligibility and perceived listening effort. Hearing Research 426 (2022) 108598. [CrossRef] [PubMed] [Google Scholar]
  23. R. Huber, A. Pusch, N. Moritz, J. Rennies, H. Schepker, B.T. Meyer: Objective assessment of a speech enhancement scheme with an automatic speech recognition-based system, in Speech Communication; 13th ITG-Symposium, 10–12 October 2018, Oldenburg, Germany. 2018, 1–5. [Google Scholar]
  24. J.C. Lagarias, J.A. Reeds, M.H. Wright, P.E. Wright: Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM Journal of Optimization 9, 1 (1998) 112–147. [CrossRef] [Google Scholar]
  25. J.A. Gößwein: addMusicalNoise. 2023. Accessed: Jan. 17 2023. [Online]. Available: https://github.com/goesjon/addMusicalNoise. [Google Scholar]

Cite this article as: Gößwein JA. Kollmeier B. & Rennies J. 2023. Method to control the amount of “musical” noise for speech quality assessments. Acta Acustica, 7, 18.

All Tables

Table 1

SNR and k values resulting in ten linearly spaced PSM values.

All Figures

thumbnail Figure 1

Calculated PSM values (open symbols) and sigmoid curve fits (solid lines) as function of the SNR for integer k values (different colors). The red bars separate the plateau, i.e., the lowest 5% PSM value range, from the upper 95% PSM value range for each value k > 0.

In the text
thumbnail Figure 2

Listening effort as function of the PSM, curve fits to the individual rating data for each participant (light gray lines), the mean curve (black line), and the LEAP model prediction (dashed black line) are shown.

In the text
thumbnail Figure 3

MOS as function of the PSM, curve fits to the individual rating data for each participant (light gray lines) and the mean curve (black line) are shown. The four fitting segments are illustrated by dashed black lines.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.