Issue
Acta Acust.
Volume 8, 2024
Topical Issue - Active Noise and Vibration Control
Article Number 39
Number of page(s) 14
DOI https://doi.org/10.1051/aacus/2024051
Published online 20 September 2024

© The Author(s), Published by EDP Sciences, 2024

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

Even though Active Noise Control (ANC) technology has been around for many decades, aiming to cancel out undesired acoustic noise by generating an appropriate anti-noise signal i.e. a signal with equal amplitude but inverted phase with respect to the noise in order to achieve a destructive interference between the two signals, it has only recently been applied to headphones, which are the most popular personal listening devices, with such market expected to further increase in the coming years.

The most popular ANC applications utilize the Filtered-x Least Mean Squares (FxLMS) algorithm and its extensions [1], boasting high performance, a simple structure, a robust operation and low computational complexity [24]. Driven by the fact that noise degrades the physiological and psychological health of humans [5], such algorithms attempt to reduce the total noise power at the location of the listener, by minimizing the error using the signals captured from the reference microphones, commonly located at the outside of the headphone shell.

A common approach between such algorithms is that they largely ignore the spatial sensitivity of such systems [6] by attempting to minimize the total external noise. Several schemes have been proposed in order to alleviate this shortcoming, through the implementation of a beamformer along the ANC system or hearing-aid application [79] with the goal of achieving a “transparency” effect, where the user is able to hear specific external sounds as if the headphones were not worn without actually removing them. An adaptive-gain algorithm for fixed pre-trained filters from different directions of arrival was proposed in [10], while a cascaded biquad filter structure proposed in [11] addresses the feedforward controller design problem with respect to multiple incident directions of the noise.

A Targeted through Beamforming ANC (TBANC) approach has been proposed in [12], utilizing a Time-Domain Beamformer (TDBF) in order to attenuate a specific noise source. While this method was able to successfully attenuate the background noise field to acceptable levels, it was limited to single source scenarios. In a previous study [13], the TBANC approach was extended to handle multi-source scenarios with diffuse noise, referred to as TBANC-D.

In this work, the TBANC-D approach is further developed by the addition of a Smart Cognitive Disturbance Module (SCDM), that decides which source is the most disturbing for the listener, based on a novel Disturbance metric that determines the severity of distinct noises in a complex sound field. The Disturbance metric was derived from a listening test procedure that is an enhanced version of the one conducted in [13]. In this procedure, the assessors evaluated the evoked disturbance of sources based on the nature of the source, as well as its relative position in space with respect to the listener. The SCDM drives the operation of the TBANC-D system, by steering the TDBF to the source deemed to be the most disturbing, using information about the auditory scene provided by a Sound Event Localization and Detection (SELD) model. Based on a multi-reference control strategy [14] shown to outperform other state-of-the-art ANC approaches, the ANC system is tested in multiple scenarios through computer simulations, in order to evaluate its performance and robustness.

The paper is structured as follows: Section 2 formulates the problem; the employed TDBF is described along with the original multi-reference ANC strategy, the proposed ANC approach and the SELD Neural Network. The experimental setup, the listening test methodology and the sound-field simulation approach are also presented in this section. Section 3 reports the results of the listening test, along with statistical analyses that provide further understanding on the evoked disturbance of each sound source, and presents the results of the evaluation of the proposed ANC method, in comparison to the baseline multi-reference ANC algorithm in terms of convergence speed, steady state error as well as its robustness to beamformer steering error. Finally, Section 4 presents the conclusions from this work.

2 Methods

The ANC methodology presented in this study is founded on the idea that within a multifaceted acoustic environment, there can be a distinct sound source that is prevalent and disruptive, thereby linked with a higher Disturbance metric. This is termed as the primary noise source.

In real-world applications, the existence of such a source can be detected using a straightforward differential energy detector. That is, when the noise level from one microphone surpasses a specific predefined threshold, as detailed in [12], the hemiplane where the source resides can be identified.

In the suggested method, the goal of the ANC controller is to attenuate the entire sound field, with a particular emphasis on the primary noise source. For this purpose, the Multi-Reference ANC (MRFANC) [14] was enhanced with a beamformer, which in this study is implemented as a simple TDBF.

Moreover, the multi-reference approach practically enables the combination of two anti-noise signals. One is mainly based on the primary noise with the intention of further reducing such an unwanted stimulus, and the other is effectively based on the diffuse sound field.

The suggested approach comprises a SELD machine learning model, a TDBF, and a MRFANC controller [14]. The operation of this system, termed TBANC-D, is guided by a SCDM which utilizes a Disturbance metric, derived from a subjective listening test, to identify the most disruptive sound source for the listener.

The Disturbance metric gauges the intensity of various noises within a complex sound field. It takes into account not only the nature of the sound source but also its spatial position relative to the listener.

An overview of the proposed system is shown in Figure 1, with the signals captured by the left and right microphones being used to extract the features which are then fed into the SELD model. The SELD model consequently outputs the Sound Event Detection (SED) estimates, showing which source classes are active in the current scene, as well as the Direction of Arrival Estimates (DOAE), showing the direction of arrival of each source. The SCDM, using the auditory scene information provided by the SELD model, outputs the DOA of the primary disturbing source p̂$ \widehat{p}$ which directs the operation of the TBANC-D by steering the TDBF towards the source identified as the most disruptive.

thumbnail Figure 1

Block diagram of the proposed TBANC-D system. p̂$ \widehat{p}$ denotes the DOA of the primary disturbing source as determined by the SCDM.

2.1 Problem formulation

This study considers a complex sound field where the headphone user is exposed to a combination of diffuse, spatially white noise and two noise sources: a primary noise source p(n) that is particularly disturbing to the listener, and a secondary noise source s(n). The setup is illustrated in Figure 2. In this scenario, the right ear, which is closer to the primary noise source, is assumed to be the “Good-Ear” as it captures a higher energy level of both p(n) and s(n) compared to the left ear.

thumbnail Figure 2

Schematic of the ANC setup formulation. The primary noise source p(n) is denoted by a red speaker with its respective DOA given by p̂$ \widehat{p}$, while the secondary source s(n) is denoted by a blue speaker. Broadband white noise sources, denoted by the gray arrows, are placed with a 5° spacing in order to simulate a diffuse noise field.

The signals captured by the two reference microphones mL(n) and mR(n), particularly the signal from the Bad-Ear, are significantly influenced by the head shadowing effect. This effect is determined by the angle of incidence of the sound source [15]. The head shadowing effect can be calculated as the Relative Transfer Function between the two Head Related Transfer Function (HRTF) channels, given by

HS(ω,θ)=HRTFBE(ω,θ)HRTFGE(ω,θ)$$ \mathrm{HS}\left(\omega,\theta \right)=\frac{\mathrm{HRT}{\mathrm{F}}_{\mathrm{BE}}(\omega,\theta )}{\mathrm{HRT}{\mathrm{F}}_{\mathrm{GE}}(\omega,\theta )} $$(1)

where θ indicates the sound angle of incidence, HRTFGE and HRTFBE indicate the transfer functions of the Good-Ear and Bad-Ear, respectively.

The noise captured by the reference microphone at the side of the Good-Ear due to a noise source emitting a signal ξ(n) (primary or diffuse) arrives unmodified, with a simple delay. On the other hand, the noise captured by the microphone at the side of the Bad-Ear is affected by the respective Head-Shadow filter calculated by equation (1). The resulting signals are then given by

mGE(n)=ξ(n-N)mBE(n)=mGE(n)×HS(ω,θ)$$ \begin{array}{ll}& {m}_{\mathrm{GE}}(n)=\xi (n-N)\\ & {m}_{\mathrm{BE}}(n)={m}_{\mathrm{GE}}(n)\times \mathrm{HS}(\omega,\theta )\end{array} $$(2)

where mGE and mBE denote the signals captured by the Good-Ear and Bad-Ear, respectively, ξ(n − N) denotes a noise source placed at an angle θ in the horizontal plane arriving with a delay N due to its distance from the Good-Ear. Depending on the angle θ of the noise source, the role of the Good-Ear is assumed by the right-ear for −180° ≤ θ ≤ 0° or by the left-ear for 180° < θ < 0° where 0° represents the angle directly in front of the listener and θ increasing in a counter-clockwise fashion.

2.2 Beamformer

A simple TDBF that performs delay-and-sum beamforming is employed in this work [16]. Plane-wave signals arriving at the phased array elements are time-aligned and then summed. The time delay applied is dependent upon the phased array element spacing, as well as the direction the beamformer is chosen to be steered at and is given by:

τ=dmicc sin(θ)$$ \tau =\frac{{d}_{\mathrm{mic}}}{c\enspace \mathrm{sin}(\theta )} $$(3)

where τ is the time delay in seconds, dmic is the distance between the two array elements, c is the speed of sound in air and θ is the direction of the incoming signal with respect to the array element, i.e. the location of the primary noise source as described in Section 2.1. The phased sensor array is formed between the two reference microphones traditionally placed on the outer shell of an ANC enabled headphone. Here, the distance between the two array elements is assumed to be dmic = 14.5 cm which is the average head width of an adult (bitragion breadth) plus an additional 1 cm to account for the typical size of the earcups and the position of the two microphones on them.

The output of the beamformer for the case where the primary noise source is located as shown in Figure 2 is then given by

BFo(n)=mL(n-τLFs)+mR(n-τRFs)$$ B{F}_{\mathrm{o}}(n)={m}_{\mathrm{L}}(n-{\tau }_{\mathrm{L}}{F}_{\mathrm{s}})+{m}_{\mathrm{R}}(n-{\tau }_{\mathrm{R}}{F}_{\mathrm{s}}) $$(4)

where BFo(n) is the beamformed signal, mL and mR the signals received at the Left and Right reference microphones respectively including the head-shadow effect as described by equation (2), τL and τR correspond to the time delays applied to the signals at each array element and Fs the chosen sampling rate.

As can be observed in Figure 3, which shows directivity patterns of the beamformer for different frequencies and steering angles, this beamformer displays quite a large side-lobe level, and beam width, due to the simple two microphone arrangement it utilizes, a combination which normally leads to problems in the presence of multiple sources. This beamformer was chosen however, due to its simple implementation and low computational cost (which is especially important to preserve the causality criteria as described in Sect. 2.9), even though a more sophisticated approach, such as the one proposed by [17], an adaptive solution [18], or a superdirective beamformer [19] can be chosen in order to significantly improve directivity of the system.

thumbnail Figure 3

Beam patterns for (blue): 0.5 kHz; (red): 1 kHz; (yellow): 3 kHz; (a): −30° steering angle; (b): −75° steering angle.

2.3 Smart cognitive disturbance module

The SCDM aims to identify the primary noise source within a complex sound field, utilizing the Disturbance metrics associated with different types of spatialized noises, as determined by human assessors through a listening test, and then provide TBANC with the angle of the primary noise source p̂$ \widehat{p}$.

The listening test panel consisted of 18 assessors, all self-reported as normal hearing. The average age of the participants was 24.3 years with a standard deviation of 7.6 and 9 participants were female. From each monophonic recording described in Section 2.5, one representative 10 s noise excerpt was selected, such that the content of each excerpt would be similar across its entire duration. The noise excerpts were spatialized through binaural synthesis, employing the Head Related Impulse Responses (HRIRs) of the KU100 mannequin of the SADIE II database [20]. Considering a left – right symmetry around a human listener in the horizontal plane, 7 binaural signals were synthesized for each noise excerpt, as is shown in Figure 4, at azimuth angles of 0°, −30°, 60°, −90°, 120°, −150° and 180° around the listener. The obtained binaural noise excerpts were loudness normalized at −22 LUFS according to [21]. The binaural noise excerpts were auralized via Ultrasone Pro 650 headphones and a custom graphical interface was employed for registering the assessments of the listeners, based on webMUSHRA [22]. During the listening procedure, the assessors were asked to rate the level of disturbance evoked by each noise sample. This rating corresponded to the Disturbance metric and was obtained by answering the question “How disturbing is each noise sample?”. For this assessment, a 9-point differential scale was used [23], with the upper limit representing high disturbance and the lower limit representing low disturbance. Additionally, a multi-stimulus procedure was employed, wherein the seven spatialized samples of each noise were presented on every page of the assessment graphical interface.

thumbnail Figure 4

Schematic representation of the azimuth angles of the binaurally synthesized noise excerpts.

2.4 Neural network features

The SELD model employed in this work was responsible for the detection and localization of sound events in the auditory scene, in order to provide the SCDM with the information required to guide the TDBF to the most disturbing source.

The Spatial Cue-Augmented Log-Spectrogram for Polyphonic SELD (SALSA) feature [24] was employed in this work. This feature was chosen due to the high performance achieved by SELD models in the DCASE 2022 challenge utilizing SALSA for microphone recording inputs, as is also the case for this work, where sources have to be detected and localized through the use of signals captured by the left and right reference microphones mGE and mBE. Specifically, a more computationally efficient version of the SALSA feature, called SALSA-Lite [25] was used.

The 2M − 1 channel SALSA-Lite is comprised by the M channel log-spectrogram features from the M reference microphones and the M − 1 channel frequency-normalized inter-channel phase differences (NIPD) [26], where M is the number of reference microphones. The NIPD is given by

Λ(t,f)=-c(2×π×f)-1arg[X1*(t,f)X2:M(t,f)]$$ \mathbf{\Lambda }(t,f)=-c(2\times \pi \times f{)}^{-1}\mathrm{arg}[{X}_1^{\mathrm{*}}(t,f){{X}}_{2:M}(t,f)] $$(5)

where t and f are the time and frequency indices; c is the speed of sound in air and Xi(tf) is the short-time Fourier transform (STFT) of the ith microphone signal.

The SALSA-Lite feature, which has an exact time-frequency alignment between the log-spectrograms and the NIPD channels, has been proven to be effective in resolving overlapping sound events [25]. This characteristic is particularly relevant for this work, as it deals with scenarios where sound events overlap.

2.5 Dataset

The dataset used to train this work consisted of monophonic recordings from the DEMAND dataset [27], as shown in Table 1, that were spatialized using the method described by equation (2).

Table 1

Selected representative recordings that the proposed system would be used in.

The spatialized recordings consisted of two simultaneous active sources, each placed in a different azimuth angle on the right hemiplane, with a step of 5° and a minimum spacing of 30°. The first 30 s of each recording were used to generate the spatialized recordings for the training stage, while the following 10 s were used for the inference stage, resulting in 36 h of data for training and 12 h for inference. Such recordings consist of the signals captured by the left and right reference microphones mGE and mBE and are sampled at 24 kHz for the purposes of training the SELD network. Only cases for the right ear were considered, since resolving the hemiplane where the sources reside is straightforward, in using a simple energy based approach as described in [12]. The case where the two sources reside on opposite hemiplanes is a point of future work.

To further enhance the available dataset, two data augmentation techniques were applied to all the features during training: Random Cutout (RC) [28, 29] and Frequency Shifting (FS) [24]. In RC either random cutout or Time-Frequency (TF) masking via SpecAugment [28] was applied on all channels of the input features, by producing a rectangular or a cross-shaped mask on the spectrograms respectively. For FS the input features were randomly shifted up or down by up to 10 bands.

2.6 Network architecture

The Convolutional Recurrent Neural Network (CRNN) network employed in this work is based on the SELD network employed in [24], modified to suit the requirements of this work is shown in Figure 5. The SELD network consists of an encoder block [30], the SED branch is formulated as a multiclass multilabel classification, while the DOAE branch is formulated as a one dimensional regression problem.

thumbnail Figure 5

Block diagram of the SELD Network. Image adapted from [24].

2.7 Hyperparameters

The STFT window length was set to 512 samples, with a hop size of 300 samples, a Hann window and 512 FFT points. A cutoff frequency of 9 kHz was selected to compute the features resulting in 192 bins for the SALSA-Lite features. Eight-second audio chunks were used during training, with an overlap of 0.5 s while the full audio clip was used during the inference stage. The Adam optimizer [31] was used with a learning rate of 0.0003 and the network was trained for 2 epochs with a batch size of 16.

2.8 Targeted beamforming ANC

The block diagram for the right-ear controller of TBANC incorporating the proposed beamforming scheme is shown in Figure 61. The signal fed to the control loudspeaker can be expressed as the summation of the control signals generated by filtering both the left and right reference signals mR and mL respectively. Without loss of generality, the anti-noise signal driving the right loudspeaker is given by

y'R(n)=yRR(n)+yLR(n)=wRR(n)*BFo(n)+wRL(n)*mL(n)$$ \begin{array}{ll}{y\prime}_{\mathrm{R}}(n)=& {y}_{\mathrm{RR}}(n)+{y}_{\mathrm{LR}}(n)\\ =& {w}_{\mathrm{RR}}(n)\mathrm{*}B{F}_o(n)+{w}_{\mathrm{RL}}(n)\mathrm{*}{m}_{\mathrm{L}}(n)\end{array} $$(6)

where * denotes the convolution operation, BFo(n) is the beamformer output, that drives the wRR adaptive filter, wRR and wRL are weights of the adaptive control filters, with wRR corresponding to the filter whose input comes from the right reference microphone and wRL corresponding to the filter whose input comes from the left reference microphone and is used to drive the right ear error sensor. The weight update procedure of the Normalized LMS controller can be expressed by:

wRR(n+1)=wRR(n)+μeR(n)x'RR(n)r+||xRR(n)||2wRL(n+1)=wRL(n)+μeR(n)x'RL(n)r+||xRL(n)||2$$ \begin{array}{ll}& {w}_{\mathrm{RR}}(n+1)={w}_{\mathrm{RR}}(n)+\mu \frac{{e}_{\mathrm{R}}(n){x\prime}_{\mathrm{RR}}(n)}{r+||{x}_{\mathrm{RR}}\mathrm{\prime}(n)|{|}^2}\\ & {w}_{\mathrm{RL}}(n+1)={w}_{\mathrm{RL}}(n)+\mu \frac{{e}_{\mathrm{R}}(n){x\prime}_{\mathrm{RL}}(n)}{r+||{x}_{\mathrm{RL}}\mathrm{\prime}(n)|{|}^2}\end{array} $$(7)

where μ is the step size, r is the regularization factor, eR(n) is the error captured by the error microphone inside the right ear cup, x'RR(n)$ {x\prime}_{\mathrm{RR}}(n)$ and x'RL(n)$ {x\prime}_{\mathrm{RL}}(n)$ are the noise captured by the right reference microphone, filtered by the estimate of the secondary path, S′(z) and the noise captured by the left reference microphone, filtered by the estimate of the secondary path, S′(z)respectively. The error signal is given by

eR(n)=aR(n)+y'R(n)×SR(n)$$ {e}_{\mathrm{R}}(n)={a}_{\mathrm{R}}(n)+{{y}^{\prime}_{\mathrm{R}}(n)\times {S}_{\mathrm{R}}(n) $$(8)

where aR(n) is the total ambient noise captured at the left error microphone attenuated by the headphone shell, yR(n) is the control signal driving the right headphone driver and SR(n) is the impulse response corresponding to the right headphone driver.

thumbnail Figure 6

Right ear controller overview of the proposed targeted ANC with beamforming. p̂$ \widehat{p}$ denotes the DOA of the primary disturbing source. An equivalent control algorithm operates independently for the left ear.

For the subsequent analysis, a typical example case is examined, and it is assumed that both the primary and secondary noise sources reside to the right of the listener in the horizontal plane, so the right-ear is chosen as the Good-Ear. This signifies that due to the head shadowing effect, the right reference microphone signal mR(n) contains significantly more energy than the respective left reference microphone signal mL(n). In order to further emphasize this, the output of the TDBF BFo, is used to drive the adaptive filter wRR and its respective controller.

2.9 Causality

The causality of ANC systems employing the FxLMS algorithm is a well studied subject [3234], especially in the context of the performance of the system. As per Figure 1, the reference signal to the controller of the right ear is the beamformed signal BFo, which introduces a delay in the system and as such guarantees further investigation.

The causality criterion for the FxNLMS system according to [32] can be expressed as

δ=ΔEL-ΔP<0$$ \delta =\Delta \mathrm{EL}-\Delta P < 0 $$(9)

where ΔEL is the total electrical delay introduced by the processing performed on the secondary path and ΔP the delay introduced by the primary path P(z) respectively. When this condition is violated, the system is considered non-causal, meaning that the adaptive filter W(z) would have to act like a prediction filter. As mentioned in [32], the delay introduced by the adaptive filter itself can be ignored without loss of generality when a sufficiently small learning rate is used, so the electrical delay ΔEL can be expressed as

ΔEL=ΔBF+ΔS$$ \Delta \mathrm{EL}=\Delta \mathrm{BF}+\Delta S $$(10)

where ΔBF is the delay introduced by the beamformer and ΔS is the delay introduced by the secondary path driver S(z).

In order to simplify the analysis, the delay introduced by the beamformer ΔBF is assumed to be the maximum delay possible, denoted by ΔBFmax, which is the delay introduced by the TDBF when the primary noise source is collinear with the two reference microphones, i.e., when the source resides at ±90°.

The primary and secondary path responses have been measured in a semi-anechoic chamber employing the methodology described in [35]. The group delay of the paths are shown in Figure 7, and the average group delays were found to be ΔS = 2.88 × 10−4 s and ΔP = 7.13 × 10−4 s respectively.

thumbnail Figure 7

Primary (top) and secondary (bottom) path group delay.

The maximum TDBF delay is given by

ΔBFmax=dmicc=0.1453434.23×10-4s$$ \Delta \mathrm{B}{\mathrm{F}}_{\mathrm{max}}=\frac{{d}_{\mathrm{mic}}}{c}=\frac{0.145}{343}\approx 4.23\times 1{0}^{-4}\mathrm{s} $$(11)

where dmic is the distance between the two reference microphones as described in Section 2.2 and c is the speed of sound in air. The causality criterion is then given by

δ=ΔBF+ΔS-ΔP$$ \delta =\Delta \mathrm{BF}+\Delta S-\Delta P $$(12a)

δ=4.23×10-4+2.88×10-4-7.13×10-4$$ \delta =4.23\times 1{0}^{-4}+2.88\times 1{0}^{-4}-7.13\times 1{0}^{-4} $$(12b)

δ=-0.02×10-4s<0$$ \delta =-0.02\times 1{0}^{-4}\mathrm{s} < 0 $$(12c)

which indicates that the system is causal, even at the maximum delay introduced by the TDBF. Since the group delay is a function of frequency for P(z) and S(z), equations (7) would have to be calculated over the entire frequency range. However, as can be seen in Figure 7 the group delay of S(z) never exceeds that of P(z), so the system is causal for all frequencies and for all source positions.

It should be noted, that in the case where an application would require a higher causality margin δ, it would be straightforward to implement, by making the headphone enclosure slightly larger to increase the microphone distance dmic.

2.10 Evaluation metrics

The performance of the SELD model for the SED task was evaluated using the metrics of Precision, Recall and F1-score given by

Precision=TPTP+FPRecall=TPTP+FNF1-score=2×Precision×RecallPrecision+Recall$$ \begin{array}{ll}\mathrm{Precision}& =\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}\\ \mathrm{Recall}& =\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}\\ \mathrm{F}1-\mathrm{score}& =\frac{2\times \mathrm{Precision}\times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}\end{array} $$(13)

where TP denotes the True-Positive classifications, FP denotes the False-Positive classifications and FN denotes the False-Negative classifications respectively.

The Mean Absolute Error (MAE) used to evaluate the performance of the DOAE task is given by

MAE=1Ni=1N|θ̂i-θi|$$ \mathrm{MAE}=\frac{1}{N}\sum_{i=1}^N |{\widehat{\theta }}_i-{\theta }_i| $$(14)

where θ̂i$ {\widehat{\theta }}_i$ denotes the predicted DOA and θi denotes the ground-truth DOA.

High Precision, Recall and F1-score values indicate that the model is able to correctly classify the events, while low MAE values indicate that the model is able to accurately predict the DOA of the events.

The performance of the ANC is evaluated using the Ie metric, which is given by

Ie=10log10(n=0NeMRFANC2(n)n=0NePROP2(n))$$ {I}_e=10{\mathrm{log}}_{10}\left(\frac{\sum_{n=0}^N {e}_{\mathrm{MRFANC}}^2(n)}{\sum_{n=0}^N {e}_{\mathrm{PROP}}^2(n)}\right) $$(15)

where Ie denotes the improvement in dB calculated as the ratio between the steady-state errors of the two methods, namely the proposed TBANC and the traditional MRFANC approach, i.e., when no beamformer or source targeting system is active. To guarantee that both approaches have already converged, the last 20 s of the simulations are used in the above calculations.

3 Results

3.1 Disturbance metric

The disturbance ratings obtained from the listening test, described in Section 2.3, were statistically analyzed to determine whether the type and azimuth angle of the noise source had an effect on the evoked disturbance of the listeners, as well as to examine interaction effects between these two variables. To this end, a two-way repeated measures Analysis of Variance (ANOVA) was performed on the obtained data, followed by post hoc paired t-tests with Bonferroni correction for each variable in order to investigate the type and the azimuth angle of noises that evoked higher disturbance on the assessors. Although, the normality assumption for a few categories of specific type and angle of noise were violated, non-parametric statistical tests [36] cannot evaluate interaction effects.

It was found that there were statistically significant differences in the means of the three types of noise (F(2, 34) = 5.825; p = .007) and in the means of the seven different azimuth angles (F(6, 102) = 21.707; p < .001). No statistical interaction effect was found (F(12, 204) = 1.001; p = .449) between azimuth angles and noise types as can be seen in Figure 8 by the almost parallel dotted lines.

thumbnail Figure 8

Mean values and 95% confidence intervals of the Disturbance metric for each type and azimuth angle of spatialized noises. The mean values of each type of noise are interpolated by the respective dotted lines.

The post-hoc analysis resulted in the following disturbance ranking with respect to source type: 1) traffic, 2) café, 3) subway station. More information about the statistical analysis can be found in Appendix along the analysis performed for different azimuth angles of the assessed noises.

In order to provide a compact way of presenting the effect of the different azimuth angles on the evoked disturbance, Figure 8 illustrates the mean values and the 95% confidence intervals of the Disturbance metric of each type and azimuth angle of the assessed noises. It is evident, that a higher level of disturbance is evoked by sources residing at the side of the listener (60° ≤ |θ| ≤ 120°), with the highest Disturbance metric observed at |θ| = 90°. Furthermore, it appears that noises located behind the assessors (120° ≤ |θ| ≤ 180°) evoked a lesser disturbance compared to those located in front them (0° ≤ |θ| ≤ 60°). These visual observations were also validated by the respective statistics presented in the Appendix.

3.2 Sound event detection

The results for the SED task are shown in Table 2 for each of the three classes, along with the average precision, recall, and F1-score. High values of the Precision metric for a sound event class indicate that only a few sound events belonging to another class were incorrectly classified to this class, whereas high values of the Recall metric indicate that only a small portion of sound events belonging to this class were incorrectly assigned to a different class. The results show that the proposed method achieves a high precision and recall for all classes, with an average precision of 82% and recall of 89%. The lower precision in the café class shows that some sound events belonging to different classes have been assigned to this class, and is perhaps attributed to the fact that strong gusts of wind are also included in the respective recordings, thus making the café class more difficult to detect. Regarding the traffic class, the lower Recall value indicates that some sound events from this class were not detected and were incorrectly assigned to a different class.

Table 2

Classification results for the SED task. The precision, recall, and F1-score are reported for each class along with their respective averages.

Table A1

Statistical analysis results of the Azimuth Angle variable. The symbol * indicates whether the means of the respective groups have statistically significant differences.

3.3 Direction of arrival estimation

In this section, the results for the DOAE task are presented, given a correct SED classification. The results regarding the achieved MAE are shown across different Signal to Noise Ratios (SNRs) in Figure 9 for the three different SNR cases evaluated in this work.

thumbnail Figure 9

MAE for the DOA estimation task for the different classes averaged across SNRs: (a) café; (b) traffic; (c) subway station.

The proposed method achieves a low MAE for all classes, with a total average of 0.34°. The average MAE achieved for the café, traffic and subway station classes was 0.72°, 0.20°, 0.08° respectively. The larger error at 90°, especially when SNR = −10 dB, can be largely attributed to the fact that at this angle the noise source is co-axial with the 2 microphones positioned on the headphone shell, resulting in smaller differences observed in the features used for the DOA estimation (as described in Sect. 2.4), mainly the IPD, resulting in harder estimation problem. The effect a potential source mistargeting might have on the performance of the TBANC system, has been studied in [12].

It should be noted here, that while it is generally not possible to resolve front-back confusion using only two microphones (as can also be observed from the front-back symmetry in Fig. 3), the proposed SELD model is likely able to exploit the head shadowing effect which affects the two reference microphone recordings in an angle-dependent manner, thus providing a way to resolve the front-back confusion.

3.4 Convergence speed

Due to the differences between the two signals, BFo(n) and mL(n), used to drive the respective adaptive filters, the convergence speed of the MRFANC algorithm is negatively affected as can be seen in Figure 10, where the proposed scheme converges to the steady-state error after 2.5 s whereas the original MRFANC algorithm converges after 1.3 s. The primary-to-diffuse SNR had no effect on the convergence speed of the algorithms as can be seen by the comparison between two different SNR levels in Figure 10, where the convergence time difference between the two algorithms remains the same in both cases.

thumbnail Figure 10

Error measured inside the right earcup for the proposed TBANC-D scheme (blue), compared to the original MRFANC approach (red). The error is shown for two different SNR levels: (a): −10 dB; (b): −5 dB.

3.5 Steady state performance

The steady state frequency domain results of the proposed TBANC-D approach can be observed in Figure 11. The performance of MRFANC and the proposed approach are similar for frequencies ≤200 Hz, but in the frequency range between 200 Hz and 10 kHz a significant improvement is observed by the TBANC-D reaching up to 20 dB improvement in the 3–5 kHz region.

thumbnail Figure 11

Spectra of the passive noise attenuation performance of the headphone shell (blue); TBANC-D approach (red); and the original MRFANC (yellow). The spectra are from the last 20 s period.

In Figure 12, the steady-state error improvement Ie (in dB) achieved by TBANC-D compared to MRFANC for different mixing scenarios is shown.

thumbnail Figure 12

Steady-state error improvement Ie (in dB) achieved by TBANC-D compared to MRFANC for different PrimarySecondary source mixing scenarios. The Ie across the secondary diagonal of all cases is 0 due to the minimum source spacing of 30° between primary and secondary sources as described in Section 2.5.

The results show that the proposed approach achieves a significant improvement in the steady-state error for all cases, except for when the traffic noise plays the role of the secondary source, i.e., the cafe-traffic scenario where performance deteriorates and the subway-traffic scenario where the performance improvement is negligible. This result however is compensated by the fact that according to the results presented in Section 2.3, this will never be the case, since the traffic noise consistently plays the role of the primary source due to the related disturbance metric.

4 Summary and discussion

In this work TBANC-D, a novel TBANC approach is proposed, which utilizes a SCDM to steer a TDBF to the primary disturbing source using a Sound Event Localization and Detection Neural Network, to more significantly attenuate the most disturbing source in a complex auditory scene, while also significantly attenuating the background noise field to acceptable levels. A novel Disturbance metric was developed, based on a listening test procedure, that determines the severity of distinct noises in a complex sound field based on their nature as well as direction of arrival.

A listening test was conducted to evaluate the disturbance caused by three types of noise at seven different azimuth angles. The results showed via statistical analysis that the type of noise and the azimuth angle had an impact on the perceived disturbance. Traffic was deemed the most disturbing noise, followed by café and subway station noise. The azimuth angle also influenced the disturbance, with noises at the side and front being more disturbing than those at the back. These findings do not perfectly align with previous research [13] indicating that both the type and spectral characteristics of noise sources have an effect on the evoked disturbance. Finally, sources at the side of the listeners were deemed most disturbing, following by sources residing to the front and then the back of the listener.

The performance of the system was evaluated through the simulation of a diffuse sound field in the presence of up to two noise sources. Specifically the Precision, Recall and F1-Score were used to gauge the performance of the SED system, and the MAE of the DOAE system, with both achieving high performance in their respective tasks with the average F1-Score being 0.85 and the average MAE being 0.33°. The TBANC component was evaluated with respect to the steady-state noise attenuation, as well as in terms of convergence speed compared to the established MRFANC approach, with the proposed TBANC approach achieving an improvement of up to 20 dB in the 3–5 kHz region, being especially important since human listeners have a significantly increased sensitivity in such region.

It is important to note that a scenario where the recordings utilized in this work would coincide to form an acoustic scene would be highly unlikely, however, such noises are excellent representations of the types of noises encountered in real-life scenarios, such as babble noise, wind, cutlery, etc. Future research will extend the proposed method to accommodate any number of sources with varying locations through the use of the employed SELD network, with similar works achieving exceptional results [24] in such scenarios. However, in such cases, the available microphones would have to increase to achieve more directive beamforming.

Furthermore, a personalized Disturbance metric would ideally be developed to further improve the experience of the headphone user. This personalized metric could take into account individual preferences and hearing characteristics, allowing for a more tailored and optimized ANC experience. By considering factors such as the listener’s hearing thresholds, sound preferences, and specific noise environments, the ANC system could dynamically adapt its parameters to provide an enhanced noise cancellation performance [37]. Efforts will also be made to better understand how disturbances are perceived by listeners using temporal evaluation methods [38, 39] to gain a deeper understanding of the effects of different types of noises with varying spectral and spatial characteristics.

In conclusion, the proposed TBANC-D approach, along with the developed Disturbance metric, has shown promising results in attenuating disturbing sources and background noise in complex auditory scenes. The evaluation metrics demonstrate the effectiveness of the system, and future research directions aim to expand its capabilities to handle more sources and personalize the ANC experience for individual listeners.

Conflicts of interest

The authors declare no conflict of interest.

Data availability statement

The data are available from the corresponding author on request.

References

  1. L. Lu, K.-L. Yin, R.C. de Lamare, Z. Zheng, Y. Yu, X. Yang, B. Chen: A survey on active noise control in the past decade – Part I: linear systems, Signal Processing 183 (2021) 108039. [CrossRef] [Google Scholar]
  2. J. Lorente, M. Ferrer, M. de Diego, A. Gonzalez: The frequency partitioned block modified filtered-x NLMS with orthogonal correction factors for multichannel active noise control, Digital Signal Processing 43 (2015) 47–58. [CrossRef] [Google Scholar]
  3. M.T. Akhtar, W. Mitsuhashi: Improving performance of FxLMS algorithm for active noise control of impulsive noise, Journal of Sound and Vibration 327, 3–5 (2009) 647–656. [CrossRef] [Google Scholar]
  4. I.T. Ardekani, W.H. Abdulla: Theoretical convergence analysis of FxLMS algorithm, Signal Processing 90, 12 (2010) 3046–3055. [CrossRef] [Google Scholar]
  5. G.W. Evans, M. Bullinger, S. Hygge: Chronic noise exposure and physiological response: a prospective study of children living under environmental stress, Psychological Science 9, 1 (1998) 75–77. [CrossRef] [Google Scholar]
  6. S. Liebich, J.-G. Richter, J. Fabry, C. Durand, J. Fels, P. Jax: Direction-of-arrival dependency of active noise cancellation headphones, in: ASME 2018 Noise Control and Acoustics Division Session presented at INTERNOISE 2018, Chicago, Illinois, USA, August 26–29, 2018. [Google Scholar]
  7. V. Patel, J. Cheer, S. Fontana: Design and implementation of an active noise control headphone with directional hear-through capability, IEEE Transactions on Consumer Electronics 66, 1 (2020) 32–40. [CrossRef] [Google Scholar]
  8. T. Xiao, B. Xu, C. Zhao: Spatially selective active noise control systems, Journal of the Acoustical Society of America 153 (2023) 2733. [CrossRef] [PubMed] [Google Scholar]
  9. R. Serizel, M. Moonen, J. Wouters, S.H. Jensen: Integrated active noise control and noise reduction in hearing aids, IEEE Transactions on Audio, Speech, and Language Processing 18, 6 (2009) 1137–1146. [Google Scholar]
  10. X. Shen, D. Shi, W.-S. Gan, S. Peksi: Adaptive-gain algorithm on the fixed filters applied for active noise control headphone, Mechanical Systems and Signal Processing 169 (2022) 108641. [CrossRef] [Google Scholar]
  11. F. An, B. Liu: Cascade biquad controller design for feedforward active noise control headphones considering incident noise from multiple directions, Applied Acoustics 185 (2022) 108430. [CrossRef] [Google Scholar]
  12. P. Zachos, J. Mourjopoulos: Beamforming headphone ANC for targeted noise attenuation, in: 154th Audio Engineering Society Convention, Espoo, Finland, 13–15 May, Audio Engineering Society, 2023. [Google Scholar]
  13. P. Zachos, G. Moiragias, J. Mourjopoulos: Targeted beamforming active noise control based on disturbance metrics, in: 10th Convention of the European Acoustics Association Forum Acusticum 2023, Politecnico di Torino, Torino, Italy, 11–15 September, 2023, pp. 3627–3634. [Google Scholar]
  14. J. Cheer, V. Patel, S. Fontana: The application of a multi-reference control strategy to noise cancelling headphones, Journal of the Acoustical Society of America 145, 5 (2019) 3095–3103. [CrossRef] [PubMed] [Google Scholar]
  15. C. Oberzut, L. Olson: Directionality and the head-shadow effect, Hearing Journal 56, 4 (2003) 56–58. [CrossRef] [Google Scholar]
  16. H.L. Van Trees: Optimum array processing: Part IV of detection, estimation, and modulation theory, John Wiley & Sons, Hoboken, New Jersey, USA, 2002. [CrossRef] [Google Scholar]
  17. O.L. Frost: An algorithm for linearly constrained adaptive array processing, Proceedings of the IEEE 60, 8 (1972) 926–935. [CrossRef] [Google Scholar]
  18. H. Cox, R. Zeskind, M. Owen: Robust adaptive beamforming, IEEE Transactions on Acoustics, Speech, and Signal Processing 35, 10 (1987) 1365–1376. [CrossRef] [Google Scholar]
  19. S.M. Kim, C.J. Chun, H.K. Kim: Multi-channel audio recording based on superdirective beamforming for portable multimedia recording devices, IEEE Transactions on Consumer Electronics 60, 3 (2014) 429–435. [CrossRef] [Google Scholar]
  20. C. Armstrong, L. Thresh, D. Murphy, G. Kearney: A perceptual evaluation of individual and non-individual HRTFS: a case study of the SADIE II database, Applied Sciences 8, 11 (2018) 2029. [CrossRef] [Google Scholar]
  21. EBU Recommendation: Loudness normalisation and permitted maximum level of audio signals, European Broadcasting Union, London, UK, 2011. [Google Scholar]
  22. M. Schoeffler, S. Bartoschek, F.-R. Stöter, M. Roess, S. Westphal, B. Edler, J. Herre: webMUSHRA – A comprehensive framework for web-based listening tests, Journal of Open Research Software 6, 1 (2018) 8. [CrossRef] [Google Scholar]
  23. N. Zacharov: Sensory evaluation of sound, CRC Press, Boca Raton, Florida, USA, 2018. [CrossRef] [Google Scholar]
  24. T.N.T. Nguyen, K.N. Watcharasupat, N.K. Nguyen, D.L. Jones, W.-S. Gan: Salsa: Spatial cueaugmented log-spectrogram features for polyphonic sound event localization and detection, IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022) 1749–1762. [CrossRef] [Google Scholar]
  25. T.N.T. Nguyen, D.L. Jones, K.N. Watcharasupat, H. Phan, W.-S. Gan: SALSA-Lite: a fast and effective feature for polyphonic sound event localization and detection with microphone arrays, in: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 23–27 May, IEEE, 2022, pp. 716–720. [Google Scholar]
  26. S. Araki, H. Sawada, R. Mukai, S. Makino: Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors, Signal Processing 87, 8 (2007) 1833–1847. [CrossRef] [Google Scholar]
  27. J. Thiemann, N. Ito, E. Vincent: The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings, in: 21st International Congress on Acoustics, Acoustical Society of America, Montreal, Canada, June, 2013. https://doi.org/10.5281/zenodo.1227120.hal-00796707. [Google Scholar]
  28. D.S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E.D. Cubuk, Q.V. Le: SpecAugment: a simple data augmentation method for automatic speech recognition, arXiv preprint, 2019. Available at https://arxiv.org/abs/1904.08779. [Google Scholar]
  29. Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang: Random erasing data augmentation, in: Proceedings of the AAAI Conference on Artificial Intelligence 34 (2020) 13001–13008. [CrossRef] [Google Scholar]
  30. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, M.D. Plumbley: PANNS: large-scale pretrained audio neural networks for audio pattern recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020) 2880–2894. [CrossRef] [Google Scholar]
  31. D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, arXiv preprint, 2014. Available at https://arxiv.org/abs/1412.6980. [Google Scholar]
  32. X. Kong, S. Kuo: Study of causality constraint on feedforward active noise control systems, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 46, 2 (1999) 183–186. [CrossRef] [Google Scholar]
  33. P. Nelson, J. Hammond, P. Joseph, S. Elliott: Active control of stationary random sound fields, Journal of the Acoustical Society of America 87, 3 (1990) 963–975. [CrossRef] [Google Scholar]
  34. R.A. Burdisso, J.S. Vipperman, C.R. Fuller: Causality analysis of feedforward controlled systems with broadband inputs, Journal of the Acoustical Society of America 94, 1 (1993) 234–242. [CrossRef] [Google Scholar]
  35. S. Liebich, J. Fabry, P. Jax, P. Vary: Acoustic path database for ANC in-ear headphone development, Universitätsbibliothek der RWTH Aachen, Aachen, Germany, 2019. [Google Scholar]
  36. M. Friedman: The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the American statistical association 32, 200 (1937) 675–701. [CrossRef] [Google Scholar]
  37. P. Zachos, G. Kamaris, J. Mourjopoulos: Feedforward headphone active noise control utilizing auditory masking, Journal of the Audio Engineering Society 72 (2024) 235–246. [CrossRef] [Google Scholar]
  38. G. Moiragias, J. Mourjopoulos: An evaluation method for temporal spatial sound attributes, in: 154th Audio Engineering Society Convention, Espoo, Finland, 13–15 May, Audio Engineering Society, 2023. [Google Scholar]
  39. G. Moiragias, J. Mourjopoulos: Analysis and model of temporal sound attributes from recorded audio, Journal of the Audio Engineering Society 72 (2024) 416–432. [CrossRef] [Google Scholar]

Appendix

The post hoc analysis on the different noise types showed that the Disturbance metric of the café noise was statistically significantly higher compared to the subway station noise (t(125) = 2.65; p = .027), while no statistically significant differences were found compared to the traffic noise (t(125) = −2.00; p = .143). The Disturbance metric of the traffic noise was statistically significantly higher compared to the subway station noise (t(125) = 4.07; p < .001). Therefore, there is strong evidence that the traffic and café noise have higher Disturbance metric compared to subway station noise, while no statement can be made for the evaluated Disturbance metric of traffic noise compared to the café noise.


1

Demos of the proposed ANC system can be found at http://audiogroup.ece.upatras.gr/tools/TBANCD.php.

Cite this article as: Zachos P. Moiragias G. & Mourjopoulos J. 2024. Targeted beamforming active noise control based on disturbance metrics. Acta Acustica, 8, 39.

All Tables

Table 1

Selected representative recordings that the proposed system would be used in.

Table 2

Classification results for the SED task. The precision, recall, and F1-score are reported for each class along with their respective averages.

Table A1

Statistical analysis results of the Azimuth Angle variable. The symbol * indicates whether the means of the respective groups have statistically significant differences.

All Figures

thumbnail Figure 1

Block diagram of the proposed TBANC-D system. p̂$ \widehat{p}$ denotes the DOA of the primary disturbing source as determined by the SCDM.

In the text
thumbnail Figure 2

Schematic of the ANC setup formulation. The primary noise source p(n) is denoted by a red speaker with its respective DOA given by p̂$ \widehat{p}$, while the secondary source s(n) is denoted by a blue speaker. Broadband white noise sources, denoted by the gray arrows, are placed with a 5° spacing in order to simulate a diffuse noise field.

In the text
thumbnail Figure 3

Beam patterns for (blue): 0.5 kHz; (red): 1 kHz; (yellow): 3 kHz; (a): −30° steering angle; (b): −75° steering angle.

In the text
thumbnail Figure 4

Schematic representation of the azimuth angles of the binaurally synthesized noise excerpts.

In the text
thumbnail Figure 5

Block diagram of the SELD Network. Image adapted from [24].

In the text
thumbnail Figure 6

Right ear controller overview of the proposed targeted ANC with beamforming. p̂$ \widehat{p}$ denotes the DOA of the primary disturbing source. An equivalent control algorithm operates independently for the left ear.

In the text
thumbnail Figure 7

Primary (top) and secondary (bottom) path group delay.

In the text
thumbnail Figure 8

Mean values and 95% confidence intervals of the Disturbance metric for each type and azimuth angle of spatialized noises. The mean values of each type of noise are interpolated by the respective dotted lines.

In the text
thumbnail Figure 9

MAE for the DOA estimation task for the different classes averaged across SNRs: (a) café; (b) traffic; (c) subway station.

In the text
thumbnail Figure 10

Error measured inside the right earcup for the proposed TBANC-D scheme (blue), compared to the original MRFANC approach (red). The error is shown for two different SNR levels: (a): −10 dB; (b): −5 dB.

In the text
thumbnail Figure 11

Spectra of the passive noise attenuation performance of the headphone shell (blue); TBANC-D approach (red); and the original MRFANC (yellow). The spectra are from the last 20 s period.

In the text
thumbnail Figure 12

Steady-state error improvement Ie (in dB) achieved by TBANC-D compared to MRFANC for different PrimarySecondary source mixing scenarios. The Ie across the secondary diagonal of all cases is 0 due to the minimum source spacing of 30° between primary and secondary sources as described in Section 2.5.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.