Open Access
Issue
Acta Acust.
Volume 7, 2023
Article Number 10
Number of page(s) 15
Section Audio Signal Processing and Transducers
DOI https://doi.org/10.1051/aacus/2023004
Published online 26 April 2023

© The Author(s), published by EDP Sciences, 2023

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

Using acoustic source localization and tracking techniques, we can localize the position of a speaker and benefit to backend acoustic signal processing [17], e.g., high-quality speech acquisition, camera steering, speaker recognition (SR), word interpretation, automatic driving and leak detection. However, acoustic source localization becomes challenging when the surrounding environment deteriorates [8], and many studies have been performed to improve acoustic source localization performance.

A popular way to estimate the position of a single source involves calculating the time delays from microphone pairs, and then finding the optimal solution to the intersection of the hyperbolas on which the emitter lies. The precision of the time delay measurement can seriously impact localization performance; thus, researchers have pursued time delay estimation methods with sufficient accuracy. Among such methods, the generalized cross-correlation (GCC) method [9] and its variants estimate the time delay of arrival (TDOA) by comparing the correlation between two received signals in the frequency domain, where the well-known Generalized Cross-Correlation Phase Transform (GCC-PHAT) weighting function is commonly employed due to its superior performance in poor conditions. Although such methods can estimate the time delay well even at moderate signal-to-noise ratio (SNR) and reverberation levels, their performance degenerates when conditions further deteriorate or competing speakers appear. With two received signals, the sparse property of speech is utilized to identify the spurious time-frequency (TF) points, inspiring the iterative contribution removal algorithm [10] for multi-source counting and TDOA estimation. In this algorithm, the TDOA estimation problem is transformed to estimate the slopes of lines. As a TF mask-based blind source separation (BSS) method, the DUET method [11, 12] separates the spectrogram of multiple speech signals into sets of TF bins by using an instantaneous likelihood function for each source. To address the reverberation problem, the above mask-based method was developed by combining GCC, which is referred to as the DUET-GCC method [13]. In the DUET-GCC method, the recovered original signals can be used to estimate the TDOAs from dependable GCC functions of multiple sources. Because the TDOA from the clutter TFs can rarely be clustered in DUET; thus, the false alarm rate of TDOA is generally low. However, in dynamic scenarios, when the frame length is short, a favorable DUET clustering result is difficult to obtain. In addition, some works on detecting the single-source points are stated in [14, 15], which can be employed to enhancement the TF-based methods.

After estimating the TDOA measurement, the position of a single source can be determined using iterative or closed-form algorithms, such as Taylor-series estimation[16], Spherical Intersection (SX) [17], Spherical Interpolation (SI) [18] and so on [1924]. However, under adverse conditions, the accuracy of these methods can be reduced dramatically due to missed detections or false alarms. For a moving source, filtering methods can be used by introducing the dynamic model of the acoustic source to improve the localization accuracy effectively. For a single source, particle filtering (PF) plays an important role in this field and can provide accurate estimation for the nonlinear and non-Gaussian state-space model [25]. To handle the environments with high energy interfering sources, a classification map based on the spectral signatures [26] is calculated to remove the influence of the disturbing sources, and a likelihood function is derived to update the weight of each particle. For multiple sources, a joint detection and tracking problem is typically considered, which means the source positions and the number of sources are time-varying. Ma et al. [27] implemented the random finite set-based sequential Monte–Carlo (RFS-SMC) method to track multiple acoustic sources. In addition, Sun and Cheng [28] focused on improving the accuracy of the measurements, and then put the refined measurements to an RFS-SMC-based filtering framework. To obtain more accurate measurements, robust least square and microphone selection-based TDOA fusion methods were proposed to resolve the measurement-association problem [29] in our recent work [30], which deeply considers the challenge by missed detections and can produce the position measurements well in complicated conditions. By adding repulsive force to a pair of sources, the Langevin dynamical model was modified and a multitarget tracking method based on Track Before Detect was introduced [31]. In addition, an implementation of probability hypothesis density (PHD) was incorporated to separate and track the unknown number of sources [32]. In a previous study [33], filter states were used to update the TF mask corresponding to each source, which improved the performance of BSS, and a revised joint probabilistic data association filter was derived to track moving sources. To associate the measurement with the source, better filtering performance can be achieved using the Rao–Blackwell theorem, which introduces a latent variable to estimate the source state [13, 34]. Note that in this work the sampling efficiency can also be improved. In [8], separated Markov chains for different sources are modeled, and both the DOAs and the associations are estimated. In [35], a binaural scene analyzer was proposed to localize, detect and identify a predefined number of speakers simultaneously. In this method, two classifiers are employed to obtain the spatial log-likelihood map and recognize the speakers, respectively. In the study [36], by using artificial neural networks, the features of equivalent rectangular bandwidth frequency cepstral coefficients and interaural level differences are extracted and used for speaker identification and azimuth estimation, respectively. In addition, the GCC-PHAT patterns have been used as the input features for deep neural networks, and two different neural network architectures, i.e. the multilayer perceptron and the convolutional neural network, have been investigated in the speaker localization context [37].

Given the TDOA measurements, the measurement-source association is commonly unknown. Although the latent variable inference method [13, 34] can be used to associate the measurement and the source, the association at the measurement level has not been exploited. In addition, although there are reports on the joint work on recognition and tracking, few studies have utilized speaker identity information directly in the tracking process, making the measurement-source association ambiguous. To address this issue, we consider both the recognized speaker identifier and TDOA measurement to obtain additional gains. In terms of the signals recovered by DUET, although classical methods can be competent for speaker recognition, they require a sufficiently long signal. In short-time scenarios, extracting stable speech features from a single specific microphone is challenging.

Primary contributions of this paper are summarized as follows.

  1. We incorporate the speaker identity to tracking multiple close acoustic targets by the SMC implementation in the RFS framework.

  2. The separating degree criterion is introduced to conduct the reliable microphone selection, which is benefit to acquire high-quality BSS results and TDOA estimates.

  3. For each particle, the corresponding measurement likelihood function is re-evaluated and the update of the weight is derived when the recognized speaker identity is involved. Extensive simulations are conducted to demonstrate the superiority of the proposed approach.

The remainder of this paper is organized as follows. In Section 2, the acoustic source tracking problem is formulated, and a brief review of the DUET BSS method is reviewed. The proposed criterion of microphone pair selection for BSS is proposed in Section 3. In addition, the model-based SR foundation is introduced, and, after introducing the recognized speaker identifier, the method by which the weights of each particle in the RFS-SMC tracking method are updated is derived. In Section 4, an experimental assessment on BSS and SR is described, the proposed method is evaluated under a variety of conditions, and the tracking advantage when sources are in close range is verified. Finally, conclusions and suggestions for future works are given in Section 5.

2 Problem formulation

2.1 Signal propagation and measurement model

Assume that there are M microphones and N speakers at discrete time t, under the reverberation condition, the received mixture at the mth microphone is given as follows [38, 39]: x m ( t ) = n = 1 N l = 1 L nm h mn ( τ l ) S n ( t - τ l ) + ν m ( t ) , m = 1 ,   ,   M , $$ {x}_m(t)=\sum_{n=1}^N\sum_{l=1}^{{L}_{{nm}}}{h}_{{mn}}\left({\tau }_l\right){S}_n\left(t-{\tau }_l\right)+{\nu }_m(t),\hspace{1em}m=1,\enspace \dots,\enspace M, $$(1)where s n (t) is the nth acoustic source signal, ν m (t) is the ambient noise on the mth microphone, and h mn (τ l ) is the room impulse response function (RIR) between source n and microphone m. In addition, τ l , l = 1,⋯, L mn , are the delay of the direct path of signal and the delays of the numerous reflected paths of signals, and L mn is the length of RIR. Typically, we can assume that the ambient noise is a white Gaussian process independent of the source signals.

In TDOA-based localization methods, TDOA estimates z k [ q ] $ {z}_k^{[q]}$ from the qth microphone pair in frame k can be modeled by z k [ q ] = τ [ q ] ( x k ) + v k [ q ] , q = 1 , , Q , $$ {z}_k^{[q]}={\tau }^{[q]}\left({\mathbf{x}}_k\right)+{v}_k^{[q]},\hspace{1em}q=1,\cdots,Q, $$(2)where x k denotes the coordinate of the source, v k [ q ] $ {v}_k^{[q]}$ is the uncorrelated measurement noise, and Q is the total number of microphone pairs. For τ [q](·), the relationship between the true TDOA and the position of the source is: τ [ q ] ( x k ) = 1 c ( | | x k - u 2 [ q ] | | - | | x k - u 1 [ q ] | | ) , $$ {\tau }^{[q]}\left({\mathbf{x}}_k\right)=\frac{1}{c}\left(||{\mathbf{x}}_k-{\mathbf{u}}_2^{[q]}||-||{\mathbf{x}}_k-{\mathbf{u}}_1^{[q]}||\right), $$(3)where { u 1 [ q ] , u 2 [ q ] } $ \left\{{\mathbf{u}}_1^{[q]},{\mathbf{u}}_2^{[q]}\right\}$ are the position vectors of the qth microphone pair, and c is the speed of sound [27].

2.2 Sparsity-based DUET method and TDOA estimation

After being divided to frames and short time Fourier transform (STFT), the received signal in the frequency domain can be approximated by X m ( k , ω ) = n = 1 N H mn ( ω ) S n ( k , ω ) + V m ( k , ω ) , m = 1 ,   , M , $$ {X}_m\left(k,\omega \right)=\sum_{n=1}^N {H}_{{mn}}\left(\omega \right){S}_n\left(k,\omega \right)+{V}_m\left(k,\omega \right),\hspace{1em}m=1,\enspace \cdots,M, $$(4)where k denotes the time frame index and ω is the angular frequency [39]. In addition, X m (k, ω), S n (k, ω) and V m (k, ω) are the TF domain representations of X m (·), S n (·) and V m (·) in equation (1), respectively. H mn ( ω ) = l h mn ( τ l ) e - τ l $ {H}_{{mn}}\left(\omega \right)=\sum_l {h}_{{mn}}\left({\tau }_l\right){e}^{-{j\omega }{\tau }_l}$ is the transfer function between source n and microphone m at frequency ω.

Under the W-disjoint orthogonality (WDO) hypothesis assumption [12], we have S i ( k , ω ) S j ( k , ω ) = 0 ,   ( k , ω ) , i j . $$ {S}_i\left(k,\omega \right){S}_j\left(k,\omega \right)=0,\enspace \hspace{1em}\forall \left(k,\omega \right),\hspace{0.5em}\forall i\ne j. $$(5)Here, i and j represent different sources.

Without loss of generality, we set one microphone as the reference. Consider only the direct path is present, i.e. in the anechoic condition, the signals received from one microphone pair can be expressed as follows: x 1 ( t ) = j = 1 N s ̃ j ( t ) + v 1 ( t ) , $$ {x}_1(t)=\sum_{j=1}^N {\mathop{s}\limits^\tilde}_j(t)+{v}_1(t), $$(6a) x 2 ( t ) = j = 1 N a j s ̃ j ( t - δ j ) + v 2 ( t ) . $$ {x}_2(t)=\sum_{j=1}^N {a}_j{\mathop{s}\limits^\tilde}_j\left(t-{\delta }_j\right)+{v}_2(t). $$(6b)

Here, s ̃ j ( t ) $ {\mathop{s}\limits^\tilde}_j(t)$ represents the received signal of source j at microphone one, a j is the attenuation factor and δ j is the time delay between this microphone pair, j = 1, ⋯, N. Consider the no noise case, after STFT we have [ X 1 ( k , ω ) X 2 ( k , ω ) ] = [ 1 1 a 1 e - δ 1 a N e - δ N ] [ S ̃ 1 ( k , ω ) S ̃ N ( k , ω ) ] . $$ \left[\begin{array}{c}{X}_1\left(k,\omega \right)\\ {X}_2\left(k,\omega \right)\end{array}\right]=\left[\begin{array}{ccc}1& \dots & 1\\ {a}_1{e}^{-{i\omega }{\delta }_1}& \dots & {a}_N{e}^{-{i\omega }{\delta }_N}\end{array}\right]\left[\begin{array}{c}{\mathop{S}\limits^\tilde}_1\left(k,\omega \right)\\ \vdots \\ {\mathop{S}\limits^\tilde}_N\left(k,\omega \right)\end{array}\right]. $$(7a)

For disjoint orthogonal sources, at most one source will be non-zero for a given ω, then [ X 1 ( k , ω ) X 2 ( k , ω ) ] = [ 1 a j e - δ j ] [ S j ( k , ω ) ] ,   for   some   j . $$ \left[\begin{array}{c}{X}_1\left(k,\omega \right)\\ {X}_2\left(k,\omega \right)\end{array}\right]=\left[\begin{array}{c}1\\ {a}_j{e}^{-{i\omega }{\delta }_{\mathrm{j}}}\end{array}\right]\left[{S}_j\left(k,\omega \right)\right],\hspace{1em}\enspace \mathrm{for}\enspace \mathrm{some}\enspace j. $$(7b)

Therefore, the ratio of the time-frequency representations of the mixtures depends only on the mixing parameters associated with the active source component: R 21 ( k , ω ) = X 2 ( k , ω ) X 1 ( k , ω ) = a j e - δ j $$ {R}_{21}\left(k,\omega \right)=\frac{{X}_2\left(k,\omega \right)}{{X}_1\left(k,\omega \right)}={a}_j{e}^{-{i\omega }{\delta }_j} $$(8)and the position features, referred to as the level ratio (LR) and time delay (TD), can be obtained as equation (9a) and equation (9b), respectively, a j ( k , ω ) = | R 21 ( k , ω ) | , $$ {a}_j\left(k,\omega \right)=\left|{R}_{21}\left(k,\omega \right)\right|, $$(9a) δ j ( k , ω ) = - 1 ω R 21 ( k , ω ) , $$ {\delta }_j\left(k,\omega \right)=-\frac{1}{\omega }\mathrm{\angle }{R}_{21}\left(k,\omega \right), $$(9b)where ∠ represents the phase of complex number. In equation (9b), when the spacing of the microphone pair is small enough, the spatial aliasing can be avoid and the TD is equivalent to TDOA. Then, a two-dimensional (2D) histogram of joint LR and TD can be accumulated from all TF points, and spurious peaks may be presented, where the locations of the peaks reveal the mixing parameters. Furthermore, a ̃ ( k , ω ) = a ( k , ω ) - 1 a ( k , ω ) $ \mathop{a}\limits^\tilde\left(k,\omega \right)=a\left(k,\omega \right)-\frac{1}{a(k,\omega )}$, which is called symmetric ratio, is commonly used for its symmetry at a ̃ = 0 $ \mathop{a}\limits^\tilde=0$ when microphone signals are swapped. To separate blind signals and estimate the TDOAs in a robust manner, a clustering method can be employed to generate the TF mask of different sources. Then, the TF points corresponding to the TF mask for each source are regressed back to the time domain to restore the speech signals. In a previous study [40], the authors emphasized that speech signals satisfy a weakened version of WDO, which is referred to as approximate WDO. In this case, the TF masks are determined by computing the maximum likelihood (ML) as follows: L j ( k , ω ) p ( X 1 ( k , ω ) , X 1 ( k , ω ) | a ̃ j , δ j ) = 1 2 π σ 2 exp { - ( 1 / 2 σ 2 ) | a ̃ j e - i δ j ω X 1 ( k , ω ) - X 2 ( k , ω ) | 2 / ( 1 + a ̃ j 2 ) } , $$ \begin{array}{c}{L}_j\left(k,\omega \right)\triangleq p\left({X}_1\left(k,\omega \right),{X}_1\left(k,\omega \right)|{\mathop{a}\limits^\tilde}_j,{\delta }_j\right)\\ =\frac{1}{2\pi {\sigma }^2}\mathrm{exp}\left\{-\left(1/2{\sigma }^2\right){\left|{\mathop{a}\limits^\tilde}_j{e}^{-i{\delta }_j\omega }{X}_1\left(k,\omega \right)-{X}_2\left(k,\omega \right)\right|}^2/\left(1+{\mathop{a}\limits^\tilde}_j^2\right)\right\},\end{array} $$(10)where a ̃ j $ {\mathop{a}\limits^\tilde}_j$ and δ j are the position features of source j, which are determined according to the location of the corresponding peak, σ 2 is the variance of the complex Gaussian noise signals. Next, a weighted 2D histogram can be constructed for s j by taking those time-frequency points with L j (k, ω) ≥ L i (k, ω), ∀i ≠ j. Then, the TF mask for each source is expressed as follows: M j ( k , ω ) { 1 , if   j = argmax m L m ( k , ω ) 0 , otherwise , $$ {M}_j\left(k,\omega \right)\triangleq \left\{\begin{array}{cc}1,& \mathrm{if}\enspace j=\underset{m}{\mathrm{argmax}}{L}_m\left(k,\omega \right)\\ 0,& \mathrm{otherwise}\end{array},\right. $$(11)which is the indicator function for the support of s j . Finally, the source signals can be restored. S ̂ j ( k , ω ) = M j ( k , ω ) X 1 ( k , ω ) + a ̃ j e i δ j ω X 2 ( k , ω ) 1 + a ̃ j 2 . $$ {\widehat{S}}_j\left(k,\omega \right)={M}_j\left(k,\omega \right)\frac{{X}_1\left(k,\omega \right)+{\mathop{a}\limits^\tilde}_j{e}^{i{\delta }_j\omega }{X}_2\left(k,\omega \right)}{1+{\mathop{a}\limits^\tilde}_j^2}. $$(12)

Compared to TD [40], the position feature of LR in equation (9a) is less credible, which is especially true in the short frame case. Thus, it is ignored in this paper, and a 1D histogram is accumulated. After applying the moving average filter over the histogram [41], shown in equation (13), C smooth ( ϕ ) = 1 2 W + 1 w = - W W C ( ϕ - w ) , $$ {C}_{\mathrm{smooth}}\left(\phi \right)=\frac{1}{2W+1}\sum_{w=-W}^W C\left(\phi -w\right), $$(13)the number of peaks is determined by counting the prominent peaks. In the above equation, C(ϕ) denotes the accumulation at bin ϕ, and 2W + 1 is the length of the smoothing window. Then, the k-means clustering method is employed with the location of the peaks as its initial value to find the TDOAs. A flowchart of the TDOA generation and DUET BSS for a single microphone pair (MP) is shown in Figure 1. After clustering, TDOA measurements are obtained. Meanwhile, distinct masks are generated to recover the original signals. Note that the TDOAs can be obtained directly from DUET [11]; thus, in the followings, the DUET-based TDOAs are considered as the measurements. In addition, the separated signals are employed and combined with the TDOA measurement to improve tracking performance.

thumbnail Figure 1

Flowchart of TDOA Generation and DUET BSS (Only one microphone pair is depicted).

3 Joint short-time speaker recognition and tracking

In this section, we first describe how the received signals are separated. Then how speaker recognition is employed to help to track multiple acoustic sources under the RFS-SMC framework is discussed. To obtain the recognized speaker identifier as accurately as possible, only parts of MPs are selected to BSS, which are used to recognize speakers. Then, the recognized results and the TDOA measurements are combined to improve the tracking performance.

3.1 Microphone pair selection for reliable blind source separation

Usually, satisfactory DUET histogram can be obtained in an anechoic scenario, and the BSS problem can be solved easily. If the sources appear in the vicinity of the same hyperbola of a single microphone pair, e.g. the midline for MP 1 in Figure 2a, the WDO assumption is typically violated, and it is difficult to discriminate the mask for each source. Although there is another position feature, i.e. the level ratio in equation (8), the DUET method may not separate the mixed signals successfully in practice (Fig. 2b) because the peak of one source in the DUET histogram would be submerged by other sources. In addition, as the level of reverberation or noise increases, the TF spectrogram is smeared and blurred, and the performance of the DUET BSS method will be reduced significantly.

thumbnail Figure 2

(a) Two speakers appear near the same hyperbola of MP 1 and (b) Corresponding accumulated DUET histogram.

To resolve this problem, we suggest switching the microphone pair to other pairs to find the desired MPs and improve the performance of BSS. Although the DUET method by MP 1 cannot form the true peaks successfully, additional microphone pairs will fill this gap and discriminate the sources. Practically, two or more microphone pairs may be available, and determining which microphone pair is the most suitable in recovering the signals should be considered. As shown in Figure 3, the 2D DUET histograms of three additional microphone pairs in Figure 2a are shown. Here, the histograms with peaks that are distant from each other and with equivalent values are preferable because such histograms indicate less mutual interferences by the sources. Thus, the sources will be easier to separate.

thumbnail Figure 3

DUET histograms formed by four microphone pairs.

So, the separating degree ρ is defined as the criterion used to identify such a microphone pair. ρ [ q ] = 1 C q , N ̂ 2 i = 1 C q , N ̂ 2 ρ i [ q ] , $$ {\rho }^{[q]}=\frac{1}{{C}_{q,\widehat{N}}^2}\sum_{i=1}^{{C}_{q,\widehat{N}}^2} {\rho }_i^{[q]}, $$(14a) ρ i [ q ] = α 2 h i 1 h i 2 + h i 2 h i 1 + ( 1 - α ) | | m i 1 - m i 2 | | 2 c τ max [ q ] . $$ {\rho }_i^{[q]}=\alpha \frac{2}{\frac{{h}_{{i}_1}}{{h}_{{i}_2}}+\frac{{h}_{{i}_2}}{{h}_{{i}_1}}}+\left(1-\alpha \right)\frac{||{\mathbf{m}}_{{i}_1}-{\mathbf{m}}_{{i}_2}||}{2c{\tau }_{\mathrm{max}}^{[q]}}. $$(14b)

Here N ̂ $ \widehat{N}$ is the estimated number of histogram peaks for microphone pair q, and C q , N ̂ 2 $ {C}_{q,\widehat{N}}^2$ is the combination number of any two peaks [i 1, i 2]. In addition, (m i1, m i2) and (h i1, h i2) represent the locations and the heights of the two peaks in equation (13), respectively, and α is a controlling factor for the two terms of the summation in equation (14b), where 0 < α < 1. Note that τ max [ q ] $ {\tau }_{\mathrm{max}}^{[q]}$ is the maximal time delay for microphone pair q. The first term of the summation in equation (14b) derives from the inequality h i 1 h i 2 + h i 2 h i 1 2 $ \frac{{h}_{{i}_1}}{{h}_{{i}_2}}+\frac{{h}_{{i}_2}}{{h}_{{i}_1}}\ge 2$, which represents the relative heights of two peaks. Here, a small value indicates that one cluster of TFs is likely to be background noise. The second term represents the distance between the centroids of two clusters. In our case, both terms are normalized. In terms of α, when the histogram has smooth peaks, e.g. under low SNR or high reverberation conditions, a small value is preferable, and the second term will play a more important role in calculating the separating degree.

After defining the separating degree criterion, the desired microphone pairs for BSS are determined as follows:

  1. Select the microphone pairs with the maximal number of detected DUET peaks.

  2. Calculate the separating degree of the selected MPs using equations (14a) and (14b).

  3. Use MPs with the top N ̂ MP $ {\widehat{N}}_{\mathrm{MP}}$ largest separating degree values for BSS.

The maximum number of detected DUET peaks among all MPs is used in step (1). This is because more TDOAs, especially true measurements and more accurate speaker identifiers, are considered. In step (3), to avoid the negative impact by false speaker recognition, only a few MPs are used to recognize speakers after BSS.

3.2 Foundation of speaker model and short-time speaker recognition

As an important part of speech signal processing, speaker recognition can identify multiple speakers according to their inherent characteristics, resulting in the potentially further discrimination of the positions of sources. Thus, to further improve tracking performance, after TDOA measurements are acquired and the source signals are recovered, speaker recognition is employed to help track the speakers. Here, the mel-frequency cepstrum coefficients (MFCC) and their deltas are extracted as speech features. Then, the well-known GMM-UBM [42] is employed to train the speaker model, which was originally proposed to solve problems related to data sparsity during training. In this model, extensive speech fragments of non-specific speakers are used to train the UBM, and only a few speech fragments of a specific speaker are used to adapt the UBM to train the speaker model.

When the speaker models are trained, the average log-likelihood ratio in equation (15) is computed to score and assess whom a voice fragment comes from S tar = 1 L k = 1 L { log P ( o k | λ tar ) - log P ( o k | λ ubm ) } , $$ {S}_{\mathrm{tar}}=\frac{1}{L}\sum_{k=1}^L\left\{\mathrm{log}P\left({\mathbf{o}}_k|{\mathbf{\lambda }}_{\mathrm{tar}}\right)-\mathrm{log}P\left({\mathbf{o}}_k|{\mathbf{\lambda }}_{\mathrm{ubm}}\right)\right\}, $$(15)where S tar is the score for a target speaker model, L is the tested number of the splitted frames of the voice fragment, o k is the extracted feature from frame k and λ tar and λ ubm are the target speaker model and the universal background model, respectively.

3.3 Proposed random finite set-based sequential Monte–Carlo tracking method

In this section, we first review the RFS-SMC tracking method [27], and then derive to compute the likelihood when updating particle weights after introducing the recognized speaker identifiers.

3.3.1 Random finite set representation of the state and the measurement

In the RFS-SMC tracking method, the number of sources in each particle is not specified. Here, each particle is represented by a finite set: X k = { x 1 , k , , x N k , k } , $$ {\mathbf{X}}_k=\left\{{\mathbf{x}}_{1,k},\cdots,{\mathbf{x}}_{{N}_k,k}\right\}, $$(16)where N k = | X k | $ {N}_k=|{\mathbf{X}}_k|$ is the cardinality of the set of source states, which represents the number of active sources in frame k. In addition, x i , k , i = 1, ⋯, N k, contains the states of the ith source, e.g. position and velocity. The measurements can also be expressed as follows: Z k [ q ] = { z 1 , k [ q ] , , z | Z k [ q ] | [ q ] } , $$ {\mathbf{Z}}_k^{[q]}=\left\{{z}_{1,k}^{[q]},\cdots,{z}_{\left|{Z}_k^{[q]}\right|}^{[q]}\right\}, $$(17)where | Z k [ q ] | $ \left|{\mathbf{Z}}_k^{[q]}\right|$ represents the number of measurements for microphone pair q in frame k, z j , k [ q ] $ {z}_{j,k}^{[q]}$, j = 1 , , | Z k [ q ] | $ j=1,\cdots,\left|{\mathbf{Z}}_k^{[q]}\right|$ is the acquired TDOA measurement, and q = 1, ⋯, Q, where Q is the total number of microphone pairs.

In the moving scenario, the state model of the particle in frame k can be expressed as follows: X k = { i = 1 , ,   x N , k - 1 S k ( x i , k - 1 , w i , k ) } B k ( b k ) , $$ {\mathbf{X}}_k=\left\{\bigcup_{i=1,\cdots,\enspace {x}_{N,k-1}} {S}_k\left({\mathbf{x}}_{i,k-1},{\mathbf{w}}_{i,k}\right)\right\}\bigcup {B}_k\left({b}_k\right), $$(18) S k ( x i , k - 1 , w i , k ) = { , H death { A x i , k - 1 + B w i , k } , H ¯ death , $$ {S}_k\left({\mathbf{x}}_{i,k-1},{\mathbf{w}}_{i,k}\right)=\left\{\begin{array}{cc}\mathrm{\varnothing },& {H}_{\mathrm{death}}\\ \left\{\mathbf{A}{\mathbf{x}}_{i,k-1}+\mathbf{B}{\mathbf{w}}_{i,k}\right\},& {\bar{H}}_{\mathrm{death}},\end{array}\right. $$(19)where S k ( x i , k - 1 , w i , k ) $ {S}_k\left({\mathbf{x}}_{i,k-1},{\mathbf{w}}_{i,k}\right)$ represents the state transition of survival source i from the k-1th frame to the kth frame and H death and H ¯ death $ {\bar{H}}_{\mathrm{death}}$ are the death and the survival hypotheses of source i, respectively. When source i disappears, its state turns to be null; otherwise, its state will evolve. In equation (18), B k (b k ) denotes the state of a new source at the kth frame, which can be modeled as follows: B k ( b k ) = { , H ¯ birth { b k } , H birth , $$ {B}_k\left({b}_k\right)=\left\{\begin{array}{cc}\mathrm{\varnothing },& {\bar{H}}_{\mathrm{birth}}\\ \left\{{b}_k\right\},& {H}_{\mathrm{birth}}\end{array}\right., $$(20)where H birth and H ¯ birth $ {\bar{H}}_{\mathrm{birth}}$ are the birth and no-birth hypotheses, respectively, and b k is a prior state vector for a new source. Typically, it is assumed that at most one source can be born at a time [27]. In fact, even if multiple sound sources appear concurrently, according to the evolution of particles, other real sources will be born in the following frames. In equation (19), A and B are the matrixes that control the evolution of the state, and w i, k is the random noise.

With respect to equation (17), the measurement set in the kth frame can be represented as follows: Z k [ q ] = { i = 1 , , | X k | T k [ q ] ( x i , k , v i , k [ q ] ) } C k [ q ] , $$ {\mathbf{Z}}_k^{[q]}=\left\{\bigcup_{i=1,\cdots,\left|{X}_k\right|} {T}_k^{[q]}\left({\mathbf{x}}_{i,k},{v}_{i,k}^{[q]}\right)\right\}\cup {C}_k^{[q]}, $$(21)where T k [ q ] ( x i , k , v i , k [ q ] ) = { H miss { τ q ( x i , k ) + v i , k [ q ] } H ¯ miss , $$ {T}_k^{[q]}\left({\mathbf{x}}_{i,k},{v}_{i,k}^{[q]}\right)=\left\{\begin{array}{cc}\mathrm{\varnothing }& {H}_{\mathrm{miss}}\\ \left\{{\tau }_q\left({\mathbf{x}}_{i,k}\right)+{v}_{i,k}^{[q]}\right\}& {\bar{H}}_{\mathrm{miss}}\end{array}\right., $$(22)

In equation (21), C k [ q ] $ {C}_k^{[q]}$ is the false measurement set, and T k [ q ] ( ) $ {T}_k^{[q]}\left(\cdot \right)$ is the true measurement set with noise variable v i , k [ q ] N ( 0 , σ v 2 ) $ {v}_{i,k}^{[q]}\sim N\left(0,{\sigma }_v^2\right)$. In equation (22), τ q ( x i , k ) $ {\tau }_q\left({\mathbf{x}}_{i,k}\right)$ is given in equation (3), and H miss and H ¯ miss $ {\bar{H}}_{\mathrm{miss}}$ are the missed detection and the detection hypotheses, respectively. For the TDOA measurement, elements in the false measurement set C k [ q ] $ {C}_k^{[q]}$ are in range [−τ max, τ max], and the number of the false measurements | C k [ q ] | $ \left|{C}_k^{[q]}\right|$ follows a Poisson distribution with an average rate of λ c .

3.3.2 Adding speaker identifier to the source state

The speaker’s inherent character is not contained in the raw representations of the source state [27]. The relationship between the measurement and the source state in the particle is primarily determined by the likelihood function. In addition, if one speaker appears, it is difficult to know whether this source spoke previously from the state vector in equation (16).

To improve the performance of the RFS-SMC, we conduct the source state augmentation by appending the speaker’s identifier to the source state. Then, the state can be extended as follows: ξ k = [ x k T , γ k , η k ] T , $$ {\mathbf{\xi }}_k={\left[{\mathbf{x}}_k^T,{\gamma }_k,{\eta }_k\right]}^T, $$(23)where x k is the source position, γ k is the birth time of the source, and η k is the speaker identifier, which can be used to associate a new session of one specific speaker to a previous time, i.e., when a new source is born, we can determine whether this speaker appeared previously.

3.3.3 Birth process

In the PF algorithm, the appearance of a new source is allowed by sampling in a probability of P birth, and then the potential new target is born. In our implementation, the birth process is further controlled by the recognized speaker identifiers. Here, the results recognized from the separated signals across different microphone pairs may differ; thus, to obtain more consistent speaker identifiers, we employ the SR result of the best reliable MP to guide the birth process of a new source. As discussed in Section 3.1, the SR result of the MP with the largest separating degree is used to add the identifier to the particles.

In addition, the speaker identifier is introduced to the particle; thus, when new frames arrive, only the speaker not contained in the particle needs to be added. Therefore, when no new speaker appears with the measurement, the birth process of a new source is denied, which reduces the birth behavior of false alarms.

3.3.4 Survival of sources

In the particle evolution process, if one source in the particle survives in the current frame, the state will be updated. After introducing the speaker identifier, three cases should be considered.

  • If N ̂ = N $ \widehat{N}=N$ i , i.e. the number of sources detected by DUET BSS equals the number of sources contained in particle i, the state of all sources should be updated according to the identifier.

  • If N ̂ < N i $ \widehat{N}<{N}_i$, sources will disappear at high probability. In this case, a trust probability of the number of sources estimated by DUET BSS, i.e. P BSS, is defined here. When this event occurs, sources that are not associated with any measurement should be removed from the particle, thereby achieving rapid handling of death behavior, and the sources that are associated with the acquired TDOA measurements will evolve.

  • If N ̂ > N i $ \widehat{N}>{N}_i$, the state of each survival source should be updated according to the speaker identifier.

3.3.5 Update weights of particles

When the state transition density is used as the proposal distribution [27], the weights of the particles in frame k can be updated as follows: ω k ( i ) = q = 1 Q g q ( Z k [ q ] | X k ( i ) ) ω k - 1 ( i ) . $$ {\omega }_k^{(i)}=\prod_{q=1}^Q {g}_q\left({\mathbf{Z}}_k^{[q]}|{\mathbf{X}}_k^{(i)}\right){\omega }_{k-1}^{(i)}. $$(24)

The likelihood function of the TDOA set of the qth microphone pair for the ith particle is given as follows: g q ( Z k [ q ] | X k ( i ) ) = Z ̃ k [ q ] Z k [ q ] g true ,   q ( Z ̃ k [ q ] | X k ( i ) ) c q ( Z k [ q ] | Z ̃ k ( q ) ) , $$ {g}_q\left({\mathbf{Z}}_k^{[q]}|{\mathbf{X}}_k^{(i)}\right)=\sum_{{\stackrel{\tilde }{\mathbf{Z}}}_k^{[q]}\subseteq {\mathbf{Z}}_k^{[q]}} {g}_{\mathrm{true},\enspace {q}}\left({\stackrel{\tilde }{\mathbf{Z}}}_k^{[q]}|{\mathbf{X}}_k^{(i)}\right){c}_q\left({\mathbf{Z}}_k^{[q]}|{\stackrel{\tilde }{\mathbf{Z}}}_k^{(q)}\right), $$(25)where g true ,   q ( Z ̃ k [ q ] | X k ( i ) ) $ {g}_{\mathrm{true},\enspace {q}}\left({\stackrel{\tilde }{\mathbf{Z}}}_k^{[q]}|{\mathbf{X}}_k^{(i)}\right)$ is the likelihood function of the true TDOAs, and c q ( Z k [ q ] ) $ {c}_q\left({\mathbf{Z}}_k^{[q]}\right)$ is the probability density function (PDF) of the false TDOAs. The conditional PDF in equation (25) can be further decomposed into the case of single source and single measurement as follows: g true , q ( Z k [ q ] | X k ( i ) ) = P miss n - m ( 1 - P miss ) m 1 i 1 i m n j = 1 m g q ( z j , k [ q ] | x i j , k ) , $$ {g}_{\mathrm{true},q}\left({\mathbf{Z}}_k^{[q]}|{\mathbf{X}}_k^{(i)}\right)={P}_{\mathrm{miss}}^{n-m}(1-{P}_{\mathrm{miss}}{)}^m\sum_{1\le {i}_1\ne \cdots \ne {i}_m\le n} \prod_{j=1}^m {g}_q\left({z}_{j,k}^{[q]}\left|{\mathbf{x}}_{{i}_j,k}\right.\right), $$(26)where n and m are the true number of sources and measurements, respectively, and P miss is the probability of missed detection. A more detailed derivation can be found in Appendix B of the literature [27]. By introducing the speaker identifier with the TDOAs, the inference of the likelihood for each source in equation (26) is to take a full probability expansion as follows: g ( z j , k [ q ] | ξ k ) = r = 1 N ̃ g ( z j , k [ q ] | ξ k , H r ) g ( H r | ξ k ) , $$ g\left({z}_{j,k}^{[q]}|{\mathbf{\xi }}_k\right)=\sum_{r=1}^{\mathop{N}\limits^\tilde} g\left({z}_{j,k}^{[q]}|{\mathbf{\xi }}_k,{H}_r\right)g\left({H}_r|{\mathbf{\xi }}_k\right), $$(27)where N ̃ $ \mathop{N}\limits^\tilde$ is the total number of individual speakers in the scenario, ξ k is the extended state in equation (23), H r represents the event that this measurement originated from the rth source, r = 1, …, N ̃ $ \mathop{N}\limits^\tilde$, and g(H r |ξ k ) is the transition probability of speaker identifier, which is introduced here and is set according to the speaker recognition accuracy. For the probability of the observations given the hidden variables g ( z j , k [ q ] | ξ k , H r ) $ g\left({z}_{j,k}^{[q]}|{\mathbf{\xi }}_k,{H}_r\right)$,

  • If r = η k , we obtain:

g ( z j , k [ q ] | ξ k , H r ) = g ( z j , k [ q ] | η k , H r ) = g ( z j , k [ q ] | η k ) , $$ g\left({z}_{j,k}^{[q]}|{\mathbf{\xi }}_k,{H}_r\right)=g\left({z}_{j,k}^{[q]}|{\eta }_k,{H}_r\right)=g\left({z}_{j,k}^{[q]}|{\eta }_k\right), $$(28)where η k $ {\eta }_k$ is the identifier involved in the source state in equation (23).
  • If rη k ,

g ( z j , k [ q ] | η k , H r ) = g ( z j , k [ q ] | H r ) . $$ g\left({z}_{j,k}^{[q]}|{\eta }_k,{H}_r\right)=g\left({z}_{j,k}^{[q]}|{H}_r\right). $$(29)

Here, the measurement can be associated to a specific source in the particle; thus, the likelihood of η k is employed in equation (28). In addition, when the identifier turns to be the other, the speaker recognition result is used in equation (29), and an intuitive and technical method to compute the likelihood of H r is to use the normalized log-likelihood ratio in equation (15) as follows: g ( z j , k [ q ] | H r ) = 0.5 tanh ( β g ( o k | H r ) ) + 0.5 . $$ g\left({z}_{j,k}^{[q]}|{H}_r\right)=0.5\cdot \mathrm{tanh}\left(\beta \cdot g\left({\mathbf{o}}_k|{H}_r\right)\right)+0.5. $$(30)where tanh is the Hyperbolic Tangent function, β is a control parameter, and g(o k  | H r ) is the score S tar in equation (15). Generally, in equation (27), the measurement associated with the true source will contribute the most when updating the state, thereby allowing us to track different speakers without ambiguity.

3.3.6 Flowchart of proposed tracking approach

The detailed flowchart of the proposed approach (for simplicity, only a single MP is depicted) is given in Figure 4.

  1. Short-time Fourier transform (STFT) is performed on the received signals.

  2. Process each MP with DUET method in the TF domain, and obtain the 2D histogram

  3. Remove the dimension of LR to form a one-dimensional (1D) histogram.

  4. Determine the number of DUET peaks, cluster the 1D histogram samples using the clustering method and employ the cluster center as TDOA measurements.

  5. Select the MPs with the greatest number of measurements (possibly multiple pairs) and calculating the separating degree.

  6. Separate the received signals from the selected MPs.

  7. Extract the MFCC features from the separated signals.

  8. Recognize speakers according to the MFCCs.

  9. For the birth process, according to the SR result of the MP with the greatest separating degree, allow a source not contained in the particle to be born. For the survival of sources, computing the likelihood according to equation (27) to update the source state and update the particle weights according to equation (24).

thumbnail Figure 4

Flowchart of proposed approach (Only one microphone pair is depicted).

Note that the separated signals of all MPs obtained by the DUET method are used for SR, and the number of separated signals N ̂ $ \widehat{N}$, may differ across different MPs. To obtain more consistent speaker identifiers in the birth process, only the recognized identifiers from the MP with the largest separating degree are used. In step (8), for each microphone pair, the weight update process utilizes the measurements together with the corresponding recognized speaker identifiers and the scores.

4 Experimental evaluation

4.1 Experimental setup

To evaluate the proposed approach, the IMAGE method [43] was used to generate the room reverberation. Here, the reverberation levels are set by adjusting the reflection coefficients of the walls. All raw voice fragments were taken from the speech recordings downloaded from the Scientific American podcasts [45]. In addition, to obtain clean signals from the specific speakers, the background audio, e.g. the speech of non-specific speakers and music, were removed from the original wave files. Finally, the data corpus is uploaded to Figshare repository [46]. After receiving the signals by microphones, they are sampled at 8000 Hz and then framed in two levels. The first-level frame was used to accumulate a desired DUET histogram and obtain the TDOA measurements, the length of which should be considered. In this context, longer frames are favorable because more credible recovered signals can be obtained, and more robust speech features would be extracted from longer frames. However, overly long frames will not reflect the motion of sources over time. The second-level frame, which is based on the first-level frame, was used to obtain the TF bins for the DUET method and extract the MFCCs. In the following sections, the length of the first-level frame was set to various lengths of points, and also various overlapping were considered. For the second-level frame, the length was set to 1024 points with overlap of 256 points uniformly. Assume that the length of the first-level frame is 4096 points. In this case, 13 sub-frames will be used to form the DUET histogram and extract the MFCCs in such first-level frame.

The parameters of the DUET method were set as follows. After converting the 2D histogram to the 1D histogram, the length of the smoothing windows was set to three bins in equation (13), and the K-means clustering method was used to cluster the TFs with the locations of the peaks as the initial points. Note that two additional assumptions are considered for indoor scenarios in this paper.

  1. In one time frame, at most one speaker source can be born.

  2. Only few speakers are active simultaneously, and the number of speakers is not known in advance.

The first assumption makes it easy to implement the RFS-SMC tracking method. The second assumption is to ensure sufficient performance of the speaker recognition algorithm in short time scenarios. Note that the second assumption is valid in practical application. In the following experiments, at most two simultaneously active speakers were considered [13, 27]. Then, at most two TDOAs from each MP were collected for subsequent tracking. After the energy-based voice activity detection [31], the TDOAs when their corresponding DUET peak values were greater than 5% of the height of the highest peak were used. In addition, the weight α was set to 0.4 in equation (14b).

When training the speaker model, i.e., registering speakers, speeches of nine male and eleven female speakers (duration of each speech fragment: 1–2 min) were used to train the UBM and GMM speaker models, and 32 Gaussian components were present in the GMM models. During tracking, at most four different enrolled speakers were considered, i.e., N ̃ = 4 $ \mathop{N}\limits^\tilde=4$ in equation (27). When extracting the speech feature [44], 36 dimensions of the MFCCs and their deltas were obtained.

Here, the Langevin process [27] was used to describe the dynamic model. In the survival process, the probability of the trust probability P BSS (Sect. 3.3.4) was set to 0.75 and the transition probability in equation (27) was set to 0.3. For the normalized log-likelihood ratio in equation (30), the control parameter β was set to 1. The other main parameters of the baseline RFS-SMC method were set as follows: P birth = 0.05, P death = 0.01, P miss = 0.25 and λ c  = 1. In addition, the number of the particles for RFS-SMC was set to 500. Without explicit explanation, each experiment was run 500 times.

4.2 Performances of blind source separation and speaker recognition methods with short frame

Before tracking the speakers, the performances of the BSS and SR methods with short frame were evaluated first. Since it is not necessary to separate a single source signal, the BSS performance evaluation involved two sources simultaneously. Under the anechoic condition with SNR being 30 dB, the evaluation of the BSS and the SR is described in the followings.

In this case, the room size was set to 3 m × 3 m × 3 m (length × width × height) and a total of six microphone pairs with inter-microphone spacing of 4 cm were used to record signals. This close distance was adopted to avoid spatial aliasing. Note that the implementation of DUET with larger MP spacing can be found in the literature [11]. In addition, a fixed height of 1.5 m was set for all microphones and sources for 2D tracking. The specific MP placement is shown in Figure 5. As can be seen, four MPs are placed near walls to detect the sources distant from the room center, and the two MPs located at the center of the room were used to facilitate better separation. The two speakers were stationary at the (0.5, 0.8) and (2.5, 2.2) positions. The duration of speeches uttered by these speakers was 15 s.

thumbnail Figure 5

Placement of six microphone pairs and two speakers in scenario 1. For MP5, the microphones at (1.50, 1.48) and (1.50, 1.52) were paired. For MP6, the microphones at (1.48, 1.50) and (1.52, 1.50) were paired.

To evaluate the BSS results, the recovered-signal-to-noise ratio (RSNR) of the restored speech was used (in dB). RSN R i = 10 log ( t = 1 T s i 2 ( t ) t = 1 T ( s i ( t ) - s ̂ i ( t ) ) 2 ) . $$ \mathrm{RSN}{\mathrm{R}}_i=10\mathrm{log}\left(\frac{\sum_{t=1}^T {s}_i^2(t)}{\sum_{t=1}^T {\left({s}_i(t)-{\widehat{s}}_i(t)\right)}^2}\right). $$(31)

Here, s i $ {s}_i$ and s ̂ i $ {\widehat{s}}_i$ are the original and restored speech signals, respectively. In addition, each signal was normalized in advance, and the compared two signals are time-aligned prior to calculating the RSNR values.

When the length of the first-level frame was set to 4096 points with overlapping of 1024 points, the performance of BSS and SR is shown in Table 1. In this table, each row represents the results in each frame, and columns two and three show the RSNR values of the two separated signals, respectively. Column four shows the average RSNR, and columns five and six indicate whether the separated signals were recognized correctly. As can be seen, only a small number of errors occurred in speaker recognition despite low RSNR values.

Table 1

Performances of BSS and SR (√ denotes recognizing correctly, and ×denotes recognizing wrongly).

Due to the fact that longer frames contain more information about the speaker’s inherent characteristics, which allows the BSS method to realize high performance, it is preferable to select longer frames to recognize speakers. To evaluate the real-time tracking, after testing various frame lengths and overlapping, a frame length of 4096 points with overlap of 2048 points for the first-level frame was used in the subsequent performance evaluation.

4.3 Evaluation of acoustic source tracking

To evaluate the estimation of the number of sources, the probability of correct estimation of the number of speakers is defined as follows: P ( | X ̂ k | = | X k | ) . $$ P\left(\left|{\widehat{\mathbf{X}}}_k\right|=\left|{\mathbf{X}}_k\right|\right). $$(32)

When the number of sources is correct, the conditional mean distance is used to assess the distance error [27] as follows: E X ̂ k { d ( X k , X ̂ k ) | correct   speaker   number   estimate } . $$ {E}_{{\widehat{{X}}}_k}\left\{d\left({\mathbf{X}}_k,{\widehat{\mathbf{X}}}_k\right)|\mathrm{correct}\enspace \mathrm{speaker}\enspace \mathrm{number}\enspace \mathrm{estimate}\right\}. $$where d ( X k , X ̂ k ) = min j i { 1 , , n } , i = 1 , , n j i j k , i k 1 n i = 1 n | | x i , k - x ̂ j i , k | | . $$ d\left({\mathbf{X}}_k,{\widehat{\mathbf{X}}}_k\right)=\underset{{j}_i\in \left\{1,\cdots,n\right\},i=1,\cdots,n{j}_i\ne {j}_k,\forall i\ne k}{\mathrm{min}}\sqrt{\frac{1}{n}\sum_{i=1}^n ||{\mathbf{x}}_{i,k}-{\widehat{\mathbf{x}}}_{{j}_i,k}||}. $$(33)

Here, the minimization attempts to identify a proper assignment between the estimated positions and the ground truth.

In addition, two existing methods were compared as the baseline methods to verify the superiority of the proposed method’s tracking performance.

  1. DUET-RFS-SMC method, which is based on the native RFS-SMC [27], only uses DUET-based TDOA measurements.

  2. DUET-RBPF method [13], which is constructed in an RFS Bayesian filtering framework and implemented by the Rao-Blackwellization PF method [13].

In RBPF, the source state is decomposed into the source position and a data association variable. Then, the source position can be marginalized out using an extended Kalman filter (EKF), and only the association variable needs to be estimated by the PF. In our implementation, at the EKF stage, the position state was initialized at a random position in the room with a velocity of 0 m/s in both directions, the initial variance was as a diagonal matrix with diagonal elements of [1, 0, 1, 0], and the measurement noise variance was set to 5 × 10−9. For fairness, all tracking methods employed DUET to generate TDOA measurements.

4.3.1 Tracking under anechoic condition

In this subsection, we consider a more complicated scenario, as shown in Figure 6. In this scenario, speaker 1 moved and spoke continuously from frames 1 to 29. After 10 frames of silence, this source moved and spoke continually for 34 frames. During this time, speaker 2 moved and spoke from frames 21 to 49; thus, we considered a total of 74 frames. Detailed trajectories are shown in Figure 6.

thumbnail Figure 6

Placement of microphones and trajectories of two sources in scenario 2.

Figure 7 shows the estimated trajectories for a single trial under the anechoic condition with a noise level of 30 dB. In this figure, the trajectories of the ground truth were marked with green solid lines, and different estimated speakers’ trajectories were represented using different colors and line styles. We found that the DUET method’s TDOA measurements can work well for tracking two speakers. As shown in Figure 8, compared to the DUET-RFS-SMC and DUET-RBPF methods, the proposed method demonstrates better number estimation and performance in terms of the conditional mean distance.

thumbnail Figure 7

Estimated tracking trajectories by proposed method for a single trial under anechoic condition, where the red circles represent the microphones.

thumbnail Figure 8

Tracking performance of two acoustic sources under anechoic condition.

4.3.2 Tracking under adverse conditions

The probability of the correct number of sources in equation (32) only assesses the detection of sources; thus, the well-matched position estimates in equation (33) are not considered in over-estimation or under-estimation of the number of sources. For further comparisons, the optimal sub-pattern assignment (OSPA) metric [47] was considered to evaluate the number estimation and distance error jointly.

For any finite subsets X = {x 1x m} and Y = {y 1, ⋯, y n }, the OSPA distance is defined as follows: d OSPA ( X , Y ) = ( 1 n ( min π Π n i = 1 m d ( κ ) ( x i , y π ( i ) ) p + κ p ( n - m ) ) ) 1 / p , $$ {d}_{\mathrm{OSPA}}\left(\mathbf{X},\mathbf{Y}\right)={\left(\frac{1}{n}\left(\underset{\pi \in {\mathrm{\Pi }}_n}{\mathrm{min}}\sum_{i=1}^m {d}^{\left(\kappa \right)}{\left({\mathbf{x}}_i,{\mathbf{y}}_{\pi (i)}\right)}^p+{\mathbf{\kappa }}^p\left(n-m\right)\right)\right)}^{1/p}, $$(34)where d ( κ ) ( x , y ) = min ( κ , d ( x , y ) ) $ {d}^{\left(\kappa \right)}\left(\mathbf{x},\mathbf{y}\right)=\mathrm{min}\left(\kappa,d\left(\mathbf{x},\mathbf{y}\right)\right)$ is the distance between x and y cut off at κ > 0 $ \kappa >0$, 1 ≤ p < ∞ and ∏ n is the set of permutation on {1, ⋯, n} for any n ∈ {1, 2, ⋯}. In the following parts, κ and p are set to 3 and 2, respectively.

To evaluate the proposed method under adverse conditions, various reverberation levels with the same SNR value (30 dB) and different noise levels with the same reverberation time (0.20 s) were considered. Figures 9a and 9b show the OSPA distances under different echoic and noise conditions, respectively. As shown in Figure 9a, as the reverberation level increased, the performance of all methods degraded gradually. The variance of the importance weights can be reduced in the Rao-Blackwellization step [13]; thus, at low reverberation levels, the DUET-RBPF method outperformed the raw RFS-SMC filter (DUET-RFS-SMC). However, when the condition quality declines, the estimated trajectories began to deviate from the ground truth. Comparatively, after introducing the speaker identifier to the source state, we found that the proposed method outperformed the compared method nearly under all tested echoic conditions. As shown in Figure 9b, the proposed method works better in adverse conditions compared to the other methods, and demonstrates constant performance under SNR levels from 20 to 40 dB.

thumbnail Figure 9

Tracking performance under (a) echoic with the SNR fixed to 30 dB and (b) noisy condition with the RT60 fixed to 0.20 s.

4.3.3 Evaluation on sources close to each other

In many scenarios, the speakers may move close to each other, and the acquired measurements would be difficult to be discriminated, which will lead to the false measurement-source association. In the tracking process, with respect to the measurement-source association, several typical cases can be shown in Figure 10. In this figure, the green solid lines represent the true trajectories of sources, and the tracks with other colors represent the filtering results. For two true trajectories, they are conducted in parallel lines and the distance between them is set to 1 m. Figure 10a shows the result when tracking two targets correctly and there is no crosstalk between the two estimated tracks. Figure 10b shows the false measurement-source association when tracking one source (represented by the blue track list), i.e., one source used the measurement of other sources to update its state, which is due to the two targets being in close range and the measurements of different sources being hard to be discriminated. Figure 10c shows the interruption (the black and the blue track lists) while target’s tracking and the tracking list (represented by the black track list) deviates from the true trajectory. This paper introduces the TDOA measurement and its corresponding speaker’s inherent character into the tracking process, and puts forward to solve the problem above.

thumbnail Figure 10

Tracking demos when two sources moves in close range. (a) Accurate tracking. (b) False association occurs while tracking. (c) Tracking list deviates from the true track.

In order to reasonably evaluate the tracking performance when sources are in close range, two speakers’ cases are taken into account and supposed to move in parallel. This experimental conduction can ensure that the two targets have enough time to be in close distance. In this scenario, the room size and the placement of the microphones are illustrated in Figure 11. Concurrently, source 1 and source 2 start to move from (1.70, 1.00) and (2.30, 1.00), respectively. And the two sources move 19 frames at a speed of 0.40 m/s.

thumbnail Figure 11

Scenario 3: two sources move in close range.

Since there is no suitable assessment method, we define the revised OSPA distance (OSPAr) to measure the tracking performance for different sources. In a single experiment, the OSPAr is determined by the following steps:

  1. Extract the center of each tracking list on the x-axis.

  2. According to the distance from its center to source 1 and source 2, each track list is assigned to the true source close to it to form two list sets.

  3. Compute the OSPA distance between each lists set and its ground truth trajectory, respectively.

The first two steps aims to assign the estimation trajectory into one true source according to the spatial position. Although this method is only suitable for the evaluation of the current scenario, the evaluation is sufficient to illustrate the superiority of the algorithm when the sources are in close range. At the second step, for one time frame, a source may correspond to null or more states in the estimated trajectory set. At the last step, the OSPA distance is used to measure the similarity between the two sets. Figure 12 shows the computed distances between the estimated under RT = 0.20 s and SNR = 30 dB, and two other baseline tracking methods are also compared. It can be seen that because the proposed method uses the speaker identity information associated with the measurement, we can distinguish the source and use the filtering algorithm for accurate tracking. Comparatively, the DUET-RFS-SMC and the hidden variable estimation based DUET-RBPF methods perform slightly worse in discriminating the targets when they are in close range. It can also be seen that the tracking performance of different sources is not the same, which is because that the measurement of source 2 is more accurate.

thumbnail Figure 12

OSPAr distance when RT=0.20 s, SNR=30 dB and the source’s spacing equals to 60 cm. (a) Proposed. (b) DUET-RFS-SMC. (c) DUET-RBPF.

When the targets are in close range, more experiments are conducted at different sources’ spacing. The relevant results are shown in Figure 13. It can be seen that better performance can be achieved at medium distances (60 cm). When the spacing between the two sources is too small (40 cm), the mutual interferences between two sources are very high, and the performance on discriminating sources will degrade.

thumbnail Figure 13

Comparisons under different sources’ spacing.

5 Conclusions

This paper uses the DUET method to separate the mixed signals and takes two acoustic sources as examples in tracking task. In the proposed method, the GMM-UBM model is employed to quickly recognize speakers. In addition, the speaker feature is extracted from each frame, and the recognized identifier is utilized to improve tracking performance. Experiments were conducted to evaluate the feasibility and validity of speaker recognition in short-time frames when one or two speakers appeared, and the results demonstrated the superiority in terms of measurement-source association. The propose method can also be extended to the case of multiple acoustic sources when more measurements are utilized.

This paper primarily considered the tracking step, and better speaker recognition algorithms are beyond the scope of this study; thus, only classic BSS and SR method were considered. However, in future, we expect that the proposed method can be improved by employing a better BSS or SR method.

Conflict of interest

The authors declare no conflict of interest.

Funding

This work was supported by the Basic Scientific Research project (JCKY2017207B042), National Natural Science Foundation of China (61673313, 61673317), Science and Technology on Underwater Test and Control Laboratory (6142407190508).

References

  1. J. Choi, J. Kim, N.S. Kim: Robust time delay estimation for acoustic indoor localization in reverberant environments. IEEE Signal Processing Letters 24, 2 (2017) 226–230. [CrossRef] [Google Scholar]
  2. M. Jia, J. Sun, C. Bao: Real-time multiple sound source localization and counting using a soundfield microphone. Journal of Ambient Intelligence & Humanized Computing 8, 6 (2017) 829–844. [CrossRef] [Google Scholar]
  3. B. Laufer-Goldshtein, R. Talmon, S. Gannot: A hybrid approach for speaker tracking based on TDOA and data-driven models. IEEE Transactions on Audio, Speech, & Language Processing 26, 4 (2018) 725–735. [CrossRef] [Google Scholar]
  4. S. Argentieri, P. Danes, P. Souères: A survey on sound source localization in robotics: From binaural to array processing methods. Computer Speech & Language 34, 1 (2015) 87–112. [CrossRef] [Google Scholar]
  5. Q. Zhang, Z. Chen, F. Yin: Speaker tracking based on distributed particle filter in distributed microphone networks. IEEE Transactions on Systems Man & Cybernetics Systems 47, 9 (2017) 2433–2443. [Google Scholar]
  6. G. Jiang, Y. Liu, Q. Kong, W. Hao, L. An: Study on acoustic time delay localization in boiler tube arrays considering effective sound velocity. Applied Acoustics 171 (2021) 107680. [CrossRef] [Google Scholar]
  7. E.A. King, A. Tatoglu, D. Iglesias, A. Matriss: Audio-visual based non-line-of-sight sound source localization: A feasibility study. Applied Acoustics 171 (2021) 107674. [CrossRef] [Google Scholar]
  8. K. Weisberg, B. Laufer-Goldshtein, S. Gannot: Simultaneous tracking and separation of multiple sources using factor graph model. IEEE/ACM Transactions on Audio, Speech, & Language Processing 28 (2020) 2848–2864. [CrossRef] [Google Scholar]
  9. C.H. Knapp, G.C. Carter: The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, & Signal Processing 24, 4 (1976) 320–327. [CrossRef] [Google Scholar]
  10. L. Wang, T.K. Hon, J. Reiss, A. Cavallaro: An iterative approach to source counting and localization using two distant microphones. IEEE/ACM Transactions on Audio, Speech, & Language Processing 24, 6 (2016) 1079–1093. [CrossRef] [Google Scholar]
  11. S. Makino, T.W. Lee, H. Sawada: Blind speech separation. Springer, Netherlands (2007). [CrossRef] [Google Scholar]
  12. S. Rickard, O. Yilmaz: On the approximate W-disjoint orthogonality of speech. IEEE International Conference on Acoustics, Speech, and Signal Procesing. Orlando, USA (2002) 529–532. [Google Scholar]
  13. X. Zhong, J.R. Hopgood: A time-frequency masking based random finite set particle filtering method for multiple acoustic source detection and tracking. IEEE Transactions on Audio, Speech, & Language Processing 23, 12 (2015) 2356–2370. [CrossRef] [Google Scholar]
  14. A. Griffin, A. Alexandridis, D. Pavlidi: Localizing multiple audio sources in a wireless acoustic sensor network. Signal Processing 107 (2015) 54–67. [CrossRef] [Google Scholar]
  15. W. Kai, R.V. Gopalan, A.W.H. Khong: Multi-source DOA estimation in a reverberant environment using a single acoustic vector sensor. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 10 (2018) 1848–1859. [Google Scholar]
  16. W.H. Foy: Position-location solutions by Taylor-series estimation. IEEE Transactions on Aerospace & Electronic Systems. AES-12 2 (1976) 189–194. [Google Scholar]
  17. H. Schau, A. Robinson: Passive source location employing spherical surfaces from time-of-arrival differences. IEEE Transactions on Acoustics, Speech, & Signal Processing 35, 8 (1987) 1223–1225. [CrossRef] [Google Scholar]
  18. J.O. Smith, J.S. Abel: Close-form least-squares source location estimation from range-difference measurements. IEEE Transactions on Audio, Speech, & Language Processing 35, 12 (1987) 1661–1669. [Google Scholar]
  19. Y.T. Chan, K.C. Ho: A simple and efficient estimator for hyperbolic location. IEEE Transactions on Signal Processing 42, 8 (2002) 1905–1915. [Google Scholar]
  20. Y. Huang, J. Benesty, , GW Elko: Passive acoustic source localization for video camera steering. In: IEEE International Conference on Acoustics, Speech, & Signal Processing (2002) 909–912. [Google Scholar]
  21. Y. Huang, J. Benesty, GW Elko: An efficient linear-correction least-squares approach to source localization. In: IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (2001) 67–70. [Google Scholar]
  22. Y. Huang, J. Benesty, G.W. Elko: Real-time passive source localization: a practical linear-correction least-squares approach. IEEE Transactions on Speech Audio Process 9, 8 (2001) 943–956. [CrossRef] [Google Scholar]
  23. K. Yang, J. An, X. Bu: Constrained total least-squares location algorithm using time-difference-of-arrival measurements. IEEE Transactions on Vehicular Technology 59, 3 (2010) 1558–1562. [CrossRef] [Google Scholar]
  24. Y. Weng, W. Xiao, L. Xie: Total least squares method for robust source localization in sensor networks using TDOA measurements. International Journal of Distributed Sensor Networks 7, 1 (2011) 1063–1067. [Google Scholar]
  25. D.B. Ward, E.A. Lehmann, R.C. Williamson: Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Transactions on Speech & Audio Processing 11, 6 (2003) 826–836. [CrossRef] [Google Scholar]
  26. M. Crocco, S. Martelli, A. Trucco, A. Zunino, V. Murino: Audio tracking in noisy environments by acoustic map and spectral signature. IEEE Transactions on Cybernetics 48, 5 (2018) 1619–1632. [CrossRef] [PubMed] [Google Scholar]
  27. W.-K. Ma, B.-N. Vo, S.S. Singh, A. Baddeley:, Tracking an unknown time-varying number of speakers using TDOA measurements: a random finite set approach. IEEE Transactions on Signal Processing 54, 9 (2006) 3291–3304. [CrossRef] [Google Scholar]
  28. L. Sun, Q. Cheng, Indoor multiple sound source localization using a novel data selection scheme. In: 48th Conference Information Sciences and Systems (2014) 1–6. [Google Scholar]
  29. A. Alexandridis, A. Mouchtaris: Multiple sound source location estimation in wireless acoustic sensor networks using DOA estimates: The data-association problem. IEEE Transactions on Acoustics, Speech, & Signal Processing 26, 2 (2018) 342–356. [Google Scholar]
  30. Y. Guo, H.Y. Zhu, X.D. Dang: Tracking multiple acoustic sources by adaptive fusion of TDOAs across microphone pairs. Digital Signal Processing 106 (2020), 102853. [CrossRef] [Google Scholar]
  31. M.F. Fallon, S. Godsill: Acoustic source localization and tracking using track before detect. IEEE Transactions on Audio, Speech, & Language Processing 18, 6 (2010) 1228–1242. [CrossRef] [Google Scholar]
  32. A. Masnadi-Shirazi, B.D. Rao: An ICA-SCT-PHD filter approach for tracking and separation of unknown time-varying number of sources. IEEE Transactions on Audio, Speech, & Language Processing 21, 4 (2013) 828–841. [CrossRef] [Google Scholar]
  33. M. Taseska, E.A.P. Habets: Blind source separation of moving sources using sparsity-based source detection and tracking. IEEE Transactions on Audio, Speech, & Language Processing 18, 6 (2018) 657–670. [CrossRef] [Google Scholar]
  34. X.H. Zhong: A Bayesian framework for multiple acoustic source tracking. PhD Thesis, University of Edinburgh, 2010. [Google Scholar]
  35. T. May, S. van de Par, A. Kohlrausch: A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation. IEEE Transactions on Audio, Speech, & Language Processing 20, 7 (2012) 2016–2030. [CrossRef] [Google Scholar]
  36. K. Youssef, K. Itoyama, K. Yoshii: Simultaneous identification and localization of still and mobile speakers based on binaural robot audition. Journal of Robotics & Mechatronics 29, 1 (2017) 59–71. [CrossRef] [Google Scholar]
  37. F. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, F. Piazza: Localizing speakers in multiple rooms by using deep neural networks. Computer Speech & Language 49 (2018) 83–106. [CrossRef] [Google Scholar]
  38. A.V. Oppenheim, R.W. Schafer: Discrete-Time Signal Processing. Pearson International (2013). [Google Scholar]
  39. L. Sun, Q. Cheng: Indoor multiple sound source localization using a novel data selection scheme. In: 48th Conf Information Sciences and Systems (2014) 1–6. [Google Scholar]
  40. O. Yilmaz, S. Richard: Blind Separation of Speech Mixtures via Time-Frequency Masking. IEEE Transactions on Signal Processing 52, 7 (2004) 1830–1847. [CrossRef] [Google Scholar]
  41. K. Wu, V.G. Reju, A.W.H. Khong: Multi-source DOA estimation in a reverberant environment using a single Acoustic vector sensor. IEEE Transactions on Audio, Speech, & Language Processing 26, 10 (2018) 1848–1859. [CrossRef] [Google Scholar]
  42. D.A. Reynolds, T.F. Quatieri, R.B. Dunn: Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 10 (2000) 19–41. [CrossRef] [Google Scholar]
  43. J.B. Allen, D.A. Berkley: Image method for efficiently simulating small-room acoustics. Journal of the Acoustical Society of America 65, 4 (1979) 943–950. [Google Scholar]
  44. S.O. Sadjadi, M. Slaney, L. Heck: MSR identity toolbox v1.0: A MATLAB toolbox for speaker recognition research. Microsoft Research Technical Report, 2013. [Google Scholar]
  45. Online. Available: http://www.scientificamerican.com/podcast. [Google Scholar]
  46. Y. Guo: voice data for speaker recognition. figshare. Media (2022) https://doi.org/10.6084/m9.figshare.21388251.v3. [Google Scholar]
  47. D. Schuhmacher, B.T. Vo, B.N. Vo: A consistent metric for performance evaluation of multi-object filters. IEEE Transactions on Signal Processing 56, 8 (2008) 3447–3457. [CrossRef] [Google Scholar]

Cite this article as: Guo Y. & Zhu H. 2023. Joint short-time speaker recognition and tracking using sparsity-based source detection. Acta Acustica, 7, 10.

All Tables

Table 1

Performances of BSS and SR (√ denotes recognizing correctly, and ×denotes recognizing wrongly).

All Figures

thumbnail Figure 1

Flowchart of TDOA Generation and DUET BSS (Only one microphone pair is depicted).

In the text
thumbnail Figure 2

(a) Two speakers appear near the same hyperbola of MP 1 and (b) Corresponding accumulated DUET histogram.

In the text
thumbnail Figure 3

DUET histograms formed by four microphone pairs.

In the text
thumbnail Figure 4

Flowchart of proposed approach (Only one microphone pair is depicted).

In the text
thumbnail Figure 5

Placement of six microphone pairs and two speakers in scenario 1. For MP5, the microphones at (1.50, 1.48) and (1.50, 1.52) were paired. For MP6, the microphones at (1.48, 1.50) and (1.52, 1.50) were paired.

In the text
thumbnail Figure 6

Placement of microphones and trajectories of two sources in scenario 2.

In the text
thumbnail Figure 7

Estimated tracking trajectories by proposed method for a single trial under anechoic condition, where the red circles represent the microphones.

In the text
thumbnail Figure 8

Tracking performance of two acoustic sources under anechoic condition.

In the text
thumbnail Figure 9

Tracking performance under (a) echoic with the SNR fixed to 30 dB and (b) noisy condition with the RT60 fixed to 0.20 s.

In the text
thumbnail Figure 10

Tracking demos when two sources moves in close range. (a) Accurate tracking. (b) False association occurs while tracking. (c) Tracking list deviates from the true track.

In the text
thumbnail Figure 11

Scenario 3: two sources move in close range.

In the text
thumbnail Figure 12

OSPAr distance when RT=0.20 s, SNR=30 dB and the source’s spacing equals to 60 cm. (a) Proposed. (b) DUET-RFS-SMC. (c) DUET-RBPF.

In the text
thumbnail Figure 13

Comparisons under different sources’ spacing.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.