Method to control the amount of “ musical ” noise for speech quality assessments

– This study presents a method of adding to clean speech signals a controlled degree of “ musical ” noise distortions that mimic typical artefacts of speech enhancement systems. The resulting distorted speech signals were evaluated with respect to listening effort and sound quality in subjective listening tests and via model predictions. Both subjective ratings and model prediction outcomes covered the entire rating scale from “ excellent ” / “ no effort ” to “ bad ” / “ extreme effort ” , respectively, in a consistent way. The proposed method proved to be useful for systematic assessments of “ musical ” noise distortions for the conditions tested in this study.


Introduction
Nonspeech audio and speech signal distortions appear in many communication systems, e.g., in telephones, headphones, and hearing aids. The causes of these distortions are manifold, e.g., bandwidth limitations, audio codecs, internal noise, dynamic range compression, or noise reduction artifacts [1][2][3][4]. Many psychoacoustical experiments rely on a controllable simulation of these distortions to examine human perception of distorted audio signals. Two widely used types of distortion simulations are hard peak clipping and the Modulated Noise Reference Unit (MNRU) [5]. The purpose of this study is to present and validate an alternative distortion simulation method which represents typical speech processing artifacts. We concentrate on "musical" noise distortions caused by a spectral subtraction algorithm, but in general this simulation method could be applied to any speech processing algorithm which does not alter the phase of the original signal.
Hard peak clipping is based on limiting the amplitude of a signal to a specific threshold, creating flat cutoffs. In the frequency domain, these flat cutoffs generate many highfrequency harmonics, which can be perceived as disturbing. The level of created distortions is controlled by the threshold, i.e., a lower threshold leads to more distortions (see, e.g., [6]).
MNRU is based on the summation of the original speech signal and random noise modulated by the original signal.
The level of distortions is controlled by the signal-to-noise ratio (SNR) of the speech and modulated noise signal. In research, MNRU is widely used to simulate signal degradations commonly found in telecommunication scenarios (see, e.g., [7]).
Despite their common use, both methods do not adequately simulate distortions typically created by noise reduction algorithms, such as the commonly used spectral subtraction type. Spectral subtraction algorithms randomly create isolated peaks in the spectrum of the output signal. These peaks are perceived similarly to short musical tones; hence the created distortions are called "musical" noise, which the above-mentioned methods fail to mimic. The amount of musical noise distortions and residual noise in the output signal of these algorithms is difficult to disentangle and, hence, to control. The aim of this study is to develop a method, which allows adding musical noise to a clean speech signal in a controlled way without creating residual noise. The goal is to produce "musical" noise artifacts that are more relevant for speech processing systems than existing ways of systematically adding distortions, and to be able to control these distortions very well with respect to the subjective impression they elicit. The method is evaluated subjectively with participant testing using rating scales, and objectively via model predictions. We investigate the perceived listening effort on a 13-point scale ranging from "no effort" to "extreme effort" [8] and the perceived sound quality using a continuous 5-point scale ranging from "bad" to "excellent" [9]. Specifically, the following research questions are addressed: Can we find parameters so that the added musical noise distortions -Evoke listening effort ratings covering the entire rating scale from "no effort" to "extreme effort"? -Lead to sound quality ratings covering the entire rating scale from "excellent" to "bad"?
Can we model the subjectively measured perceived listening effort?

Adding musical noise to a signal
Spectral subtraction noise reduction algorithms estimate speech from a noisy speech signal according to [10,11]: where f denotes frequency,Ŝðf Þ is the frequencydependent estimated speech signal spectrum, X(f) is the mixed signal spectrum (i.e., speech plus noise),Ŵ ðf Þ is the estimated noise signal spectrum, / X (f) is the phase of the mixed signal, and the parameter k ! 0 controls the subtraction (k = 0 leaves the mixed signal untouched). The estimated speech time signal can be obtained by transforming the spectrum into the time domain. In practice, the noise estimate is imperfect and, hence, the estimated speech signal contains the clean speech signal, residual noise, and random musical noise distortions. The algorithm parameters can be tuned such that noise is strongly suppressed, which typically results in strong musical tones, or vice versa. For the purpose of creating controlled simulations of distortions, it would be important to remove the residual noise from the distorted speech signal. To achieve this, we applied the method of Hagerman and Olofsson [12] and created two mixed signals as follows: where x + (n) and x À (n) are mixed signals, s(n) is the speech signal and w(n) is the noise signal, all in the discrete time domain, i.e., as function of the sampled time n. We used Gaussian noise with a lower band limit of 20 Hz and an upper band limit of 8 kHz to mask speech sampled at 16 kHz. With equation (1) we calculated one estimated speech signal for each mixed signal of equation (2). By summing the two estimated speech signals, the residual noises cancel out and we obtain the clean speech signal with musical noise distortions. The amount of musical noise distortions can be manipulated by adjusting the SNR of the two mixed signals in equation (2), and by adjusting the parameter k in equation (1). We quantified the amount of added musical noise distortions with the Perceptual Similarity Measure (PSM) of the PEMO-Q model [4]. A PSM value of 1 indicates no perceivable difference between reference and test signal. The lower the PSM, the more perceivable the difference, i.e., the more musical noise distortions were added to the reference signal. As reference speech signal we used five concatenated sentences of the Oldenburg German matrix sentence test (OLSA) procedure [13][14][15]. Each sentence had the same syntactic structure: name, verb, numeral, adjective, and object. For every word type, ten well-known German words were available which could be randomly combined to produce syntactically correct but semantically unpredictable sentences.
To determine the impact of different combinations of SNR and k values on the PSM values, we added musical noise distortions with SNR values in the interval of [À100 dB, 0 dB] in steps of 2 dB and k values in the interval of [0, 5] in steps of 0.1, because k = 5 is suggested as maximum value in the literature [16]. To account for the randomness with which the musical noise distortions were added to the signal, we repeated the PSM calculation ten times for every SNR and k value. With the help of curve fits, we calculated ten SNR and k values to be used in the evaluations.

Evaluation
For the subjective evaluation, we invited participants to assess the perceived listening effort and perceived sound quality for different amounts of added musical noise distortions. Prior to these listening tasks, the individual presentation level corresponding to medium loud was determined by adaptive categorical loudness scaling [17,18] using the software framework of Ewert [19]. All subsequent listening task stimuli were presented at this level.

Experimental setup
Participants sat in a small soundproof measurement chamber wearing open headphones (Sennheiser HD 650). The stimuli were presented diotically. The signal processing was done on a PC outside of the chamber while an RME Fireface UC sound card was used as an interface between the PC and the headphones. A 22" touch screen (iiyama prolite t2252msc) was placed in front of the participants.

Participants
Twenty-four native German-speaking, younger, normalhearing persons participated in this study (13 females, 11 males). Their ages ranged from 21 to 36 years. The participants' better ear pure-tone average (PTA4, calculated as the average hearing threshold across the four frequencies 500 Hz, 1 kHz, 2 kHz, and 4 kHz) covered a range from À6.2 dB HL to 7.5 dB HL. Participants were recruited on a voluntary basis and were paid a small allowance for their participation. Every participant was informed about the study and signed consent forms. Ethical approval was provided by the ethics committee of the University of Oldenburg (Drs.EK/2019/073-02).

Perceived listening effort
Perceived listening effort was measured using categorical listening effort ratings of 40 sound signals. The sound signals consisted of 40 different OLSA sentences: ten different PSM values each repeated four times. The sound signals were randomized and rated according to perceived listening effort on a 13-point scale as mentioned in the introduction, where the category "no effort" corresponded to a listening effort of 1 Effort Scale Categorical Unit (ESCU) and the category "extreme effort" corresponded to a listening effort of 13 ESCU. An additional category "no speech" was included to catch the stimuli in which participants could not identify any speech and, hence, could not reasonably assess the effort associated with recognizing the target speech [8]. Curves (listening effort as function of the PSM value) were fitted to the individual data points following the fitting method described by Krueger et al. [8].
In prospect of designing an adaptive listening effort scaling method in a future study, we determined the PSM value difference (DPSM) leading on average to a listening effort difference of 1 ESCU. Similar to [8], this DPSM value could be used as step size in a first phase to explore the available PSM space.

Perceived sound quality
Perceived sound quality was measured using continuous sound quality ratings of the same 40 sound signals as used for the perceived listening effort ratings. The sound signals were randomized and rated according to perceived quality of the speech on a 5-point scale as mentioned in the introduction. The rating "excellent" corresponded to a Mean Opinion Score (MOS) of 5 while the rating "bad" corresponded to a MOS of 1 [9].The participants selected a rating by continuously moving a slider between these two boarder ratings.
As already shown by Huber et al. [4], the relationship between the PSM and subjectively perceived sound quality ratings is not necessarily linear but may exhibit floor and ceiling effects (i.e., data points with the minimum and maximum rating scale value, respectively). For this reason, the individual data points were parted into four equally spaced segments in the PSM domain. A linear regression curve (sound quality as function of the PSM value) was fitted to the data points in every segment, so that one connecting line over all segments was created. This fitting method allowed to account for possible floor and ceiling effects (i.e., horizontal lines in the first and last segment, respectively).

Objective evaluation
In addition to the subjective evaluations, we used the non-intrusive LEAP model [20] to objectively evaluate the listening effort caused by the musical noise distortions. This model is based on uncertainty measures of phoneme classifications in an automatic speech recognition engine. It has been previously employed to predict the detrimental effects of various type of broadcast background sounds [21], combined and isolated noise and reverberation [22], and the beneficial effects of non-linear speech enhancement on listening effort [23]. The model was employed here without any adaptation.

Pre-evaluation considerations
The PSM values calculated according to Section 2.1 ranged from 0.45 to 1.00 with a maximum standard deviation across the ten repetitions of 0.03. We fitted sigmoid curves to the calculated PSM values for each k value in dependency of the SNR as follows: where C 1 represents the minimum value, C 2 represents the maximum value, C 3 represents the shift of the inflection point from the SNR = 0 dB coordinate, and C 4 represents the shape parameter of the sigmoid curve. The calculated PSM values and the sigmoid curve fits are illustrated for integer k values in Figure 1.
To calculate SNR and k values to be used in the evaluation, we utilized the simplex search method of Lagarias et al. [24] to find combinations of SNR and k values leading to ten linearly spaced PSM values ranging from the minimum to the maximum PSM value according to formula (3). In addition, we added the constraint, that with increasing PSM value the SNR value must increase and the k value must decrease, respectively. Since the PSM values decreased very slowly for very low SNR values (sigmoid curves approach their asymptote C 1 for SNR ( C 1 , see Fig. 1), we excluded PSM values in the lowest 5% of the PSM value range for each k value in the search algorithm (PSM > C 1 + 5%Á(C 2 À C 1 )). The corresponding separation points are illustrated by red bars in Figure 1. In this way we avoided jumps in SNR value when searching for linearly spaced PSM values. The resulting SNR, k, and PSM values are summarized in Table 1. Stimuli created with these ten parameter sets were then used in the further evaluations.

Relationship of listening effort and musical noise distortions
The curves fitted to the individual listening effort data, the mean curve, and the curve fitted to the corresponding data predicted by the LEAP model are illustrated in Figure 2.
As expected, perceived listening effort decreased with increasing PSM, i.e., decreasing distortions. The PSM values of the individual curves ranged from 0.57 for 13 ESCU to 0.99 for 1 ESCU covering the whole available listening effort scale. The mean rating behavior of the participants was well predicted by the LEAP model. The maximum difference between the mean and the LEAP curve was DPSM = 0.12 at 6 ESCU. Overall, we found a strong and significant correlation between the individual fitted PSM values and the model predicted PSM values (r = 0.95, p < 0.01). On average, a difference in PSM value of 0.03 led to a difference in perceived listening effort of 1 ESCU. We therefore define the step size DPSM value as 0.03.

Relationship of sound quality and musical noise distortions
The curves fitted to the individual sound quality data and the mean curve are illustrated in Figure 3.
The data revealed a strong floor effect: in the first segment (PSM < 0.64) the mean curve had the smallest slope of all segments (slope = 1.4). After the first segment, the mean MOS continued to increase over the segments two, three, and four with a slope of 6.7, 14.9, and 9.4 respectively. Of all participants, participant number 2 showed the strongest floor effect, the MOS increased only after the second segment (PSM > 0.76). In addition to the floor effect, participant number 1 showed a strong ceiling effect, the MOS equaled 5 in the fourth segment (PSM > 0.88).

Discussion and conclusions
The results of our study indicate that the ten identified PSM values cover the whole available listening effort scale from "extreme effort" to "no effort" needed to follow the speech with the appropriate musical noise distortion as specified in Section 2. The proposed method therefore is suitable to systematically introduce musical noise distortions with perceptual consequences covering the entire listening effort range. The data indicated that a difference of 1 ESCU corresponded to a PSM change of 0.03. We therefore propose this to serve as a minimum step size value in future studies employing this procedure. It is important to note that this step size refers to the specific conditions used in this evaluation study and may differ for other conditions.
The participants' listening effort ratings closely followed the LEAP model predictions, indicating that LEAP is applicable to quantitatively predicting the perceptual impact of this type of distortions. This is notable because the model was not adapted in any way, which suggests that it may be useful to predict the perceptual impacts of a wide range of speech disturbances [21][22][23].
With respect to sound quality ratings, the ten identified PSM values cover the whole available sound quality scale from "bad" to "excellent". In other words, the proposed method is also suitable to produce all quality categories that may be required in, e.g., system evaluation studies. The data show a strong floor effect, i.e., sound quality ratings are on average insensitive for PSM values between 0.52 and 0.64. This floor effect was already observed for audio codec distortions in [4]. We therefore suggest considering PSM values greater than 0.64 for sound quality ratings. However, two individual data sets showed strong deviations from the mean, i.e., even though choosing the suggested PSM value range, floor and/or ceiling effects could be observed.  It is important to note that four participants rated the undistorted sentences (i.e., PSM = 1) with a MOS smaller than the mean value (<4.9). A possible reason for this could be that the signals had a limited bandwidth (sample frequency of 16 kHz), which the normal-hearing participants rated as not "excellent" anymore. With its necessary simplifications, the study has inherent limitations. A specific set of conditions were used in the evaluation method: distorted OLSA sentences presented with medium-loud level. In how far the results can be generalized on different speech material (e.g., effortful speech) or different presentation level (e.g., loud levels) is unclear. Only young normal-hearing participants were invited to the experiment, for which reason the results are not applicable to, e.g., hearing-impaired persons. In how far a hearing loss and/or amplification has an influence on the ratings of musical noise distortions should be the subject of future research.