Issue 
Acta Acust.
Volume 7, 2023



Article Number  10  
Number of page(s)  15  
Section  Audio Signal Processing and Transducers  
DOI  https://doi.org/10.1051/aacus/2023004  
Published online  26 April 2023 
Scientific Article
Joint shorttime speaker recognition and tracking using sparsitybased source detection
^{1}
School of Electronic & Information Engineering, Xi’an Jiaotong University, 710049 Xi’an, China
^{2}
Engineering University of People’s Armed Police, 710086 Xi’an, China
^{*} Corresponding author: hyzhu@mail.xjtu.edu.cn
Received:
8
April
2022
Accepted:
24
February
2023
A random finite setbased sequential Monte–Carlo tracking method is proposed to track multiple acoustic sources in indoor scenarios. The proposed method can improve tracking performance by introducing recognized speaker identities from the received signals. At the frontend, the degenerate unmixing estimation technique (DUET) is employed to separate the mixed signals, and the time delay of arrival (TDOA) is measured. In addition, a criterion to select the reliable microphone pair is designed to quickly obtain accurate speaker identities from the mixed signals, and the Gaussian mixture model universal background model (GMMUBM) is employed to train the speaker model. In the tracking step, the update of the weight for each particle is derived after introducing the recognized speaker identities, which results in better association between the measurements and sources. Simulation results demonstrate that the proposed method can improve the accuracy of the filter states and discriminate the sources close to each other.
Key words: Acoustic source tracking / Blind source separation / Speaker recognition / Random finite set / Particle filtering
© The Author(s), published by EDP Sciences, 2023
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
Using acoustic source localization and tracking techniques, we can localize the position of a speaker and benefit to backend acoustic signal processing [1–7], e.g., highquality speech acquisition, camera steering, speaker recognition (SR), word interpretation, automatic driving and leak detection. However, acoustic source localization becomes challenging when the surrounding environment deteriorates [8], and many studies have been performed to improve acoustic source localization performance.
A popular way to estimate the position of a single source involves calculating the time delays from microphone pairs, and then finding the optimal solution to the intersection of the hyperbolas on which the emitter lies. The precision of the time delay measurement can seriously impact localization performance; thus, researchers have pursued time delay estimation methods with sufficient accuracy. Among such methods, the generalized crosscorrelation (GCC) method [9] and its variants estimate the time delay of arrival (TDOA) by comparing the correlation between two received signals in the frequency domain, where the wellknown Generalized CrossCorrelation Phase Transform (GCCPHAT) weighting function is commonly employed due to its superior performance in poor conditions. Although such methods can estimate the time delay well even at moderate signaltonoise ratio (SNR) and reverberation levels, their performance degenerates when conditions further deteriorate or competing speakers appear. With two received signals, the sparse property of speech is utilized to identify the spurious timefrequency (TF) points, inspiring the iterative contribution removal algorithm [10] for multisource counting and TDOA estimation. In this algorithm, the TDOA estimation problem is transformed to estimate the slopes of lines. As a TF maskbased blind source separation (BSS) method, the DUET method [11, 12] separates the spectrogram of multiple speech signals into sets of TF bins by using an instantaneous likelihood function for each source. To address the reverberation problem, the above maskbased method was developed by combining GCC, which is referred to as the DUETGCC method [13]. In the DUETGCC method, the recovered original signals can be used to estimate the TDOAs from dependable GCC functions of multiple sources. Because the TDOA from the clutter TFs can rarely be clustered in DUET; thus, the false alarm rate of TDOA is generally low. However, in dynamic scenarios, when the frame length is short, a favorable DUET clustering result is difficult to obtain. In addition, some works on detecting the singlesource points are stated in [14, 15], which can be employed to enhancement the TFbased methods.
After estimating the TDOA measurement, the position of a single source can be determined using iterative or closedform algorithms, such as Taylorseries estimation[16], Spherical Intersection (SX) [17], Spherical Interpolation (SI) [18] and so on [19–24]. However, under adverse conditions, the accuracy of these methods can be reduced dramatically due to missed detections or false alarms. For a moving source, filtering methods can be used by introducing the dynamic model of the acoustic source to improve the localization accuracy effectively. For a single source, particle filtering (PF) plays an important role in this field and can provide accurate estimation for the nonlinear and nonGaussian statespace model [25]. To handle the environments with high energy interfering sources, a classification map based on the spectral signatures [26] is calculated to remove the influence of the disturbing sources, and a likelihood function is derived to update the weight of each particle. For multiple sources, a joint detection and tracking problem is typically considered, which means the source positions and the number of sources are timevarying. Ma et al. [27] implemented the random finite setbased sequential Monte–Carlo (RFSSMC) method to track multiple acoustic sources. In addition, Sun and Cheng [28] focused on improving the accuracy of the measurements, and then put the refined measurements to an RFSSMCbased filtering framework. To obtain more accurate measurements, robust least square and microphone selectionbased TDOA fusion methods were proposed to resolve the measurementassociation problem [29] in our recent work [30], which deeply considers the challenge by missed detections and can produce the position measurements well in complicated conditions. By adding repulsive force to a pair of sources, the Langevin dynamical model was modified and a multitarget tracking method based on Track Before Detect was introduced [31]. In addition, an implementation of probability hypothesis density (PHD) was incorporated to separate and track the unknown number of sources [32]. In a previous study [33], filter states were used to update the TF mask corresponding to each source, which improved the performance of BSS, and a revised joint probabilistic data association filter was derived to track moving sources. To associate the measurement with the source, better filtering performance can be achieved using the Rao–Blackwell theorem, which introduces a latent variable to estimate the source state [13, 34]. Note that in this work the sampling efficiency can also be improved. In [8], separated Markov chains for different sources are modeled, and both the DOAs and the associations are estimated. In [35], a binaural scene analyzer was proposed to localize, detect and identify a predefined number of speakers simultaneously. In this method, two classifiers are employed to obtain the spatial loglikelihood map and recognize the speakers, respectively. In the study [36], by using artificial neural networks, the features of equivalent rectangular bandwidth frequency cepstral coefficients and interaural level differences are extracted and used for speaker identification and azimuth estimation, respectively. In addition, the GCCPHAT patterns have been used as the input features for deep neural networks, and two different neural network architectures, i.e. the multilayer perceptron and the convolutional neural network, have been investigated in the speaker localization context [37].
Given the TDOA measurements, the measurementsource association is commonly unknown. Although the latent variable inference method [13, 34] can be used to associate the measurement and the source, the association at the measurement level has not been exploited. In addition, although there are reports on the joint work on recognition and tracking, few studies have utilized speaker identity information directly in the tracking process, making the measurementsource association ambiguous. To address this issue, we consider both the recognized speaker identifier and TDOA measurement to obtain additional gains. In terms of the signals recovered by DUET, although classical methods can be competent for speaker recognition, they require a sufficiently long signal. In shorttime scenarios, extracting stable speech features from a single specific microphone is challenging.
Primary contributions of this paper are summarized as follows.

We incorporate the speaker identity to tracking multiple close acoustic targets by the SMC implementation in the RFS framework.

The separating degree criterion is introduced to conduct the reliable microphone selection, which is benefit to acquire highquality BSS results and TDOA estimates.

For each particle, the corresponding measurement likelihood function is reevaluated and the update of the weight is derived when the recognized speaker identity is involved. Extensive simulations are conducted to demonstrate the superiority of the proposed approach.
The remainder of this paper is organized as follows. In Section 2, the acoustic source tracking problem is formulated, and a brief review of the DUET BSS method is reviewed. The proposed criterion of microphone pair selection for BSS is proposed in Section 3. In addition, the modelbased SR foundation is introduced, and, after introducing the recognized speaker identifier, the method by which the weights of each particle in the RFSSMC tracking method are updated is derived. In Section 4, an experimental assessment on BSS and SR is described, the proposed method is evaluated under a variety of conditions, and the tracking advantage when sources are in close range is verified. Finally, conclusions and suggestions for future works are given in Section 5.
2 Problem formulation
2.1 Signal propagation and measurement model
Assume that there are M microphones and N speakers at discrete time t, under the reverberation condition, the received mixture at the mth microphone is given as follows [38, 39]:$${x}_{m}\left(t\right)=\sum _{n=1}^{N}\sum _{l=1}^{{L}_{\mathrm{nm}}}{h}_{\mathrm{mn}}\left({\tau}_{l}\right){S}_{n}\left(t{\tau}_{l}\right)+{\nu}_{m}\left(t\right),\hspace{1em}m=1,\dots ,M,$$(1)where s _{ n }(t) is the nth acoustic source signal, ν _{ m }(t) is the ambient noise on the mth microphone, and h _{ mn } (τ _{ l }) is the room impulse response function (RIR) between source n and microphone m. In addition, τ _{ l }, l = 1,⋯, L _{ mn }, are the delay of the direct path of signal and the delays of the numerous reflected paths of signals, and L _{ mn } is the length of RIR. Typically, we can assume that the ambient noise is a white Gaussian process independent of the source signals.
In TDOAbased localization methods, TDOA estimates ${z}_{k}^{\left[q\right]}$ from the qth microphone pair in frame k can be modeled by $${z}_{k}^{\left[q\right]}={\tau}^{\left[q\right]}\left({\mathbf{x}}_{k}\right)+{v}_{k}^{\left[q\right]},\hspace{1em}q=1,\cdots ,Q,$$(2)where x _{ k } denotes the coordinate of the source, ${v}_{k}^{\left[q\right]}$ is the uncorrelated measurement noise, and Q is the total number of microphone pairs. For τ ^{[q]}(·), the relationship between the true TDOA and the position of the source is:$${\tau}^{\left[q\right]}\left({\mathbf{x}}_{k}\right)=\frac{1}{c}\left(\left\right{\mathbf{x}}_{k}{\mathbf{u}}_{2}^{\left[q\right]}\left\right\left\right{\mathbf{x}}_{k}{\mathbf{u}}_{1}^{\left[q\right]}\left\right\right),$$(3)where $\left\{{\mathbf{u}}_{1}^{\left[q\right]},{\mathbf{u}}_{2}^{\left[q\right]}\right\}$ are the position vectors of the qth microphone pair, and c is the speed of sound [27].
2.2 Sparsitybased DUET method and TDOA estimation
After being divided to frames and short time Fourier transform (STFT), the received signal in the frequency domain can be approximated by$${X}_{m}\left(k,\omega \right)=\sum _{n=1}^{N}\mathrm{}{H}_{\mathrm{mn}}\left(\omega \right){S}_{n}\left(k,\omega \right)+{V}_{m}\left(k,\omega \right),\hspace{1em}m=1,\cdots ,M,$$(4)where k denotes the time frame index and ω is the angular frequency [39]. In addition, X _{ m } (k, ω), S _{ n } (k, ω) and V _{ m }(k, ω) are the TF domain representations of X _{ m } (·), S _{ n } (·) and V _{ m }(·) in equation (1), respectively.${H}_{\mathrm{mn}}\left(\omega \right)=\sum _{l}\mathrm{}{h}_{\mathrm{mn}}\left({\tau}_{l}\right){e}^{\mathrm{j\omega}{\tau}_{l}}$ is the transfer function between source n and microphone m at frequency ω.
Under the Wdisjoint orthogonality (WDO) hypothesis assumption [12], we have$${S}_{i}\left(k,\omega \right){S}_{j}\left(k,\omega \right)=0,\hspace{1em}\forall \left(k,\omega \right),\hspace{0.5em}\forall i\ne j.$$(5)Here, i and j represent different sources.
Without loss of generality, we set one microphone as the reference. Consider only the direct path is present, i.e. in the anechoic condition, the signals received from one microphone pair can be expressed as follows:$${x}_{1}\left(t\right)=\sum _{j=1}^{N}\mathrm{}{\stackrel{\u0303}{s}}_{j}\left(t\right)+{v}_{1}\left(t\right),$$(6a) $${x}_{2}\left(t\right)=\sum _{j=1}^{N}\mathrm{}{a}_{j}{\stackrel{\u0303}{s}}_{j}\left(t{\delta}_{j}\right)+{v}_{2}\left(t\right).$$(6b)
Here, ${\stackrel{\u0303}{s}}_{j}\left(t\right)$ represents the received signal of source j at microphone one, a _{ j } is the attenuation factor and δ _{ j } is the time delay between this microphone pair, j = 1, ⋯, N. Consider the no noise case, after STFT we have$$\left[\begin{array}{c}{X}_{1}\left(k,\omega \right)\\ {X}_{2}\left(k,\omega \right)\end{array}\right]=\left[\begin{array}{ccc}1& \dots & 1\\ {a}_{1}{e}^{\mathrm{i\omega}{\delta}_{1}}& \dots & {a}_{N}{e}^{\mathrm{i\omega}{\delta}_{N}}\end{array}\right]\left[\begin{array}{c}{\stackrel{\u0303}{S}}_{1}\left(k,\omega \right)\\ \vdots \\ {\stackrel{\u0303}{S}}_{N}\left(k,\omega \right)\end{array}\right].$$(7a)
For disjoint orthogonal sources, at most one source will be nonzero for a given ω, then$$\left[\begin{array}{c}{X}_{1}\left(k,\omega \right)\\ {X}_{2}\left(k,\omega \right)\end{array}\right]=\left[\begin{array}{c}1\\ {a}_{j}{e}^{\mathrm{i\omega}{\delta}_{\mathrm{j}}}\end{array}\right]\left[{S}_{j}\left(k,\omega \right)\right],\hspace{1em}\mathrm{for}\mathrm{some}j.$$(7b)
Therefore, the ratio of the timefrequency representations of the mixtures depends only on the mixing parameters associated with the active source component:$${R}_{21}\left(k,\omega \right)=\frac{{X}_{2}\left(k,\omega \right)}{{X}_{1}\left(k,\omega \right)}={a}_{j}{e}^{\mathrm{i\omega}{\delta}_{j}}$$(8)and the position features, referred to as the level ratio (LR) and time delay (TD), can be obtained as equation (9a) and equation (9b), respectively,$${a}_{j}\left(k,\omega \right)=\left{R}_{21}\left(k,\omega \right)\right,$$(9a) $${\delta}_{j}\left(k,\omega \right)=\frac{1}{\omega}\mathrm{\angle}{R}_{21}\left(k,\omega \right),$$(9b)where ∠ represents the phase of complex number. In equation (9b), when the spacing of the microphone pair is small enough, the spatial aliasing can be avoid and the TD is equivalent to TDOA. Then, a twodimensional (2D) histogram of joint LR and TD can be accumulated from all TF points, and spurious peaks may be presented, where the locations of the peaks reveal the mixing parameters. Furthermore, $\stackrel{\u0303}{a}\left(k,\omega \right)=a\left(k,\omega \right)\frac{1}{a(k,\omega )}$, which is called symmetric ratio, is commonly used for its symmetry at $\stackrel{\u0303}{a}=0$ when microphone signals are swapped. To separate blind signals and estimate the TDOAs in a robust manner, a clustering method can be employed to generate the TF mask of different sources. Then, the TF points corresponding to the TF mask for each source are regressed back to the time domain to restore the speech signals. In a previous study [40], the authors emphasized that speech signals satisfy a weakened version of WDO, which is referred to as approximate WDO. In this case, the TF masks are determined by computing the maximum likelihood (ML) as follows:$$\begin{array}{c}{L}_{j}\left(k,\omega \right)\triangleq p\left({X}_{1}\left(k,\omega \right),{X}_{1}\left(k,\omega \right){\stackrel{\u0303}{a}}_{j},{\delta}_{j}\right)\\ =\frac{1}{2\pi {\sigma}^{2}}\mathrm{exp}\left\{\left(1/2{\sigma}^{2}\right){\left{\stackrel{\u0303}{a}}_{j}{e}^{i{\delta}_{j}\omega}{X}_{1}\left(k,\omega \right){X}_{2}\left(k,\omega \right)\right}^{2}/\left(1+{\stackrel{\u0303}{a}}_{j}^{2}\right)\right\},\end{array}$$(10)where ${\stackrel{\u0303}{a}}_{j}$ and δ _{ j } are the position features of source j, which are determined according to the location of the corresponding peak, σ ^{2} is the variance of the complex Gaussian noise signals. Next, a weighted 2D histogram can be constructed for s _{ j } by taking those timefrequency points with L _{ j }(k, ω) ≥ L _{ i } (k, ω), ∀i ≠ j. Then, the TF mask for each source is expressed as follows:$${M}_{j}\left(k,\omega \right)\triangleq \{\begin{array}{cc}1,& \mathrm{if}j=\underset{m}{\mathrm{argmax}}{L}_{m}\left(k,\omega \right)\\ 0,& \mathrm{otherwise}\end{array},$$(11)which is the indicator function for the support of s _{ j }. Finally, the source signals can be restored.$${\widehat{S}}_{j}\left(k,\omega \right)={M}_{j}\left(k,\omega \right)\frac{{X}_{1}\left(k,\omega \right)+{\stackrel{\u0303}{a}}_{j}{e}^{i{\delta}_{j}\omega}{X}_{2}\left(k,\omega \right)}{1+{\stackrel{\u0303}{a}}_{j}^{2}}.$$(12)
Compared to TD [40], the position feature of LR in equation (9a) is less credible, which is especially true in the short frame case. Thus, it is ignored in this paper, and a 1D histogram is accumulated. After applying the moving average filter over the histogram [41], shown in equation (13),$${C}_{\mathrm{smooth}}\left(\varphi \right)=\frac{1}{2W+1}\sum _{w=W}^{W}\mathrm{}C\left(\varphi w\right),$$(13)the number of peaks is determined by counting the prominent peaks. In the above equation, C(ϕ) denotes the accumulation at bin ϕ, and 2W + 1 is the length of the smoothing window. Then, the kmeans clustering method is employed with the location of the peaks as its initial value to find the TDOAs. A flowchart of the TDOA generation and DUET BSS for a single microphone pair (MP) is shown in Figure 1. After clustering, TDOA measurements are obtained. Meanwhile, distinct masks are generated to recover the original signals. Note that the TDOAs can be obtained directly from DUET [11]; thus, in the followings, the DUETbased TDOAs are considered as the measurements. In addition, the separated signals are employed and combined with the TDOA measurement to improve tracking performance.
Figure 1 Flowchart of TDOA Generation and DUET BSS (Only one microphone pair is depicted). 
3 Joint shorttime speaker recognition and tracking
In this section, we first describe how the received signals are separated. Then how speaker recognition is employed to help to track multiple acoustic sources under the RFSSMC framework is discussed. To obtain the recognized speaker identifier as accurately as possible, only parts of MPs are selected to BSS, which are used to recognize speakers. Then, the recognized results and the TDOA measurements are combined to improve the tracking performance.
3.1 Microphone pair selection for reliable blind source separation
Usually, satisfactory DUET histogram can be obtained in an anechoic scenario, and the BSS problem can be solved easily. If the sources appear in the vicinity of the same hyperbola of a single microphone pair, e.g. the midline for MP 1 in Figure 2a, the WDO assumption is typically violated, and it is difficult to discriminate the mask for each source. Although there is another position feature, i.e. the level ratio in equation (8), the DUET method may not separate the mixed signals successfully in practice (Fig. 2b) because the peak of one source in the DUET histogram would be submerged by other sources. In addition, as the level of reverberation or noise increases, the TF spectrogram is smeared and blurred, and the performance of the DUET BSS method will be reduced significantly.
Figure 2 (a) Two speakers appear near the same hyperbola of MP 1 and (b) Corresponding accumulated DUET histogram. 
To resolve this problem, we suggest switching the microphone pair to other pairs to find the desired MPs and improve the performance of BSS. Although the DUET method by MP 1 cannot form the true peaks successfully, additional microphone pairs will fill this gap and discriminate the sources. Practically, two or more microphone pairs may be available, and determining which microphone pair is the most suitable in recovering the signals should be considered. As shown in Figure 3, the 2D DUET histograms of three additional microphone pairs in Figure 2a are shown. Here, the histograms with peaks that are distant from each other and with equivalent values are preferable because such histograms indicate less mutual interferences by the sources. Thus, the sources will be easier to separate.
Figure 3 DUET histograms formed by four microphone pairs. 
So, the separating degree ρ is defined as the criterion used to identify such a microphone pair.$${\rho}^{\left[q\right]}=\frac{1}{{C}_{q,\widehat{N}}^{2}}\sum _{i=1}^{{C}_{q,\widehat{N}}^{2}}\mathrm{}{\rho}_{i}^{\left[q\right]},$$(14a) $${\rho}_{i}^{\left[q\right]}=\alpha \frac{2}{\frac{{h}_{{i}_{1}}}{{h}_{{i}_{2}}}+\frac{{h}_{{i}_{2}}}{{h}_{{i}_{1}}}}+\left(1\alpha \right)\frac{\left\right{\mathbf{m}}_{{i}_{1}}{\mathbf{m}}_{{i}_{2}}\left\right}{2c{\tau}_{\mathrm{max}}^{\left[q\right]}}.$$(14b)
Here $\widehat{N}$ is the estimated number of histogram peaks for microphone pair q, and ${C}_{q,\widehat{N}}^{2}$ is the combination number of any two peaks [i _{1}, i _{2}]. In addition, (m _{ i1}, m _{ i2}) and (h _{ i1}, h _{ i2}) represent the locations and the heights of the two peaks in equation (13), respectively, and α is a controlling factor for the two terms of the summation in equation (14b), where 0 < α < 1. Note that ${\tau}_{\mathrm{max}}^{\left[q\right]}$ is the maximal time delay for microphone pair q. The first term of the summation in equation (14b) derives from the inequality $\frac{{h}_{{i}_{1}}}{{h}_{{i}_{2}}}+\frac{{h}_{{i}_{2}}}{{h}_{{i}_{1}}}\ge 2$, which represents the relative heights of two peaks. Here, a small value indicates that one cluster of TFs is likely to be background noise. The second term represents the distance between the centroids of two clusters. In our case, both terms are normalized. In terms of α, when the histogram has smooth peaks, e.g. under low SNR or high reverberation conditions, a small value is preferable, and the second term will play a more important role in calculating the separating degree.
After defining the separating degree criterion, the desired microphone pairs for BSS are determined as follows:

Select the microphone pairs with the maximal number of detected DUET peaks.

Calculate the separating degree of the selected MPs using equations (14a) and (14b).

Use MPs with the top ${\widehat{N}}_{\mathrm{MP}}$ largest separating degree values for BSS.
The maximum number of detected DUET peaks among all MPs is used in step (1). This is because more TDOAs, especially true measurements and more accurate speaker identifiers, are considered. In step (3), to avoid the negative impact by false speaker recognition, only a few MPs are used to recognize speakers after BSS.
3.2 Foundation of speaker model and shorttime speaker recognition
As an important part of speech signal processing, speaker recognition can identify multiple speakers according to their inherent characteristics, resulting in the potentially further discrimination of the positions of sources. Thus, to further improve tracking performance, after TDOA measurements are acquired and the source signals are recovered, speaker recognition is employed to help track the speakers. Here, the melfrequency cepstrum coefficients (MFCC) and their deltas are extracted as speech features. Then, the wellknown GMMUBM [42] is employed to train the speaker model, which was originally proposed to solve problems related to data sparsity during training. In this model, extensive speech fragments of nonspecific speakers are used to train the UBM, and only a few speech fragments of a specific speaker are used to adapt the UBM to train the speaker model.
When the speaker models are trained, the average loglikelihood ratio in equation (15) is computed to score and assess whom a voice fragment comes from$${S}_{\mathrm{tar}}=\frac{1}{L}\sum _{k=1}^{L}\left\{\mathrm{log}P\left({\mathbf{o}}_{k}{\mathbf{\lambda}}_{\mathrm{tar}}\right)\mathrm{log}P\left({\mathbf{o}}_{k}{\mathbf{\lambda}}_{\mathrm{ubm}}\right)\right\},$$(15)where S _{tar} is the score for a target speaker model, L is the tested number of the splitted frames of the voice fragment, o _{ k } is the extracted feature from frame k and λ _{tar} and λ _{ubm} are the target speaker model and the universal background model, respectively.
3.3 Proposed random finite setbased sequential Monte–Carlo tracking method
In this section, we first review the RFSSMC tracking method [27], and then derive to compute the likelihood when updating particle weights after introducing the recognized speaker identifiers.
3.3.1 Random finite set representation of the state and the measurement
In the RFSSMC tracking method, the number of sources in each particle is not specified. Here, each particle is represented by a finite set:$${\mathbf{X}}_{k}=\left\{{\mathbf{x}}_{1,k},\cdots ,{\mathbf{x}}_{{N}_{k},k}\right\},$$(16)where ${N}_{k}=\left{\mathbf{X}}_{k}\right$ is the cardinality of the set of source states, which represents the number of active sources in frame k. In addition, x _{ i } , _{ k }, i = 1, ⋯, N _{ k, } contains the states of the ith source, e.g. position and velocity. The measurements can also be expressed as follows:$${\mathbf{Z}}_{k}^{\left[q\right]}=\left\{{z}_{1,k}^{\left[q\right]},\cdots ,{z}_{\left{Z}_{k}^{\left[q\right]}\right}^{\left[q\right]}\right\},$$(17)where $\left{\mathbf{Z}}_{k}^{\left[q\right]}\right$ represents the number of measurements for microphone pair q in frame k, ${z}_{j,k}^{\left[q\right]}$, $j=1,\cdots ,\left{\mathbf{Z}}_{k}^{\left[q\right]}\right$ is the acquired TDOA measurement, and q = 1, ⋯, Q, where Q is the total number of microphone pairs.
In the moving scenario, the state model of the particle in frame k can be expressed as follows:$${\mathbf{X}}_{k}=\left\{\bigcup _{i=1,\cdots ,{x}_{N,k1}}\mathrm{}{S}_{k}\left({\mathbf{x}}_{i,k1},{\mathbf{w}}_{i,k}\right)\right\}\bigcup \mathrm{}{B}_{k}\left({b}_{k}\right),$$(18) $${S}_{k}\left({\mathbf{x}}_{i,k1},{\mathbf{w}}_{i,k}\right)=\{\begin{array}{cc}\mathrm{\varnothing},& {H}_{\mathrm{death}}\\ \left\{\mathbf{A}{\mathbf{x}}_{i,k1}+\mathbf{B}{\mathbf{w}}_{i,k}\right\},& {\overline{H}}_{\mathrm{death}},\end{array}$$(19)where ${S}_{k}\left({\mathbf{x}}_{i,k1},{\mathbf{w}}_{i,k}\right)$ represents the state transition of survival source i from the k1th frame to the kth frame and H _{death} and ${\overline{H}}_{\mathrm{death}}$ are the death and the survival hypotheses of source i, respectively. When source i disappears, its state turns to be null; otherwise, its state will evolve. In equation (18), B _{ k } (b _{ k }) denotes the state of a new source at the kth frame, which can be modeled as follows:$${B}_{k}\left({b}_{k}\right)=\{\begin{array}{cc}\mathrm{\varnothing},& {\overline{H}}_{\mathrm{birth}}\\ \left\{{b}_{k}\right\},& {H}_{\mathrm{birth}}\end{array},$$(20)where H _{birth} and ${\overline{H}}_{\mathrm{birth}}$ are the birth and nobirth hypotheses, respectively, and b _{ k } is a prior state vector for a new source. Typically, it is assumed that at most one source can be born at a time [27]. In fact, even if multiple sound sources appear concurrently, according to the evolution of particles, other real sources will be born in the following frames. In equation (19), A and B are the matrixes that control the evolution of the state, and w _{ i, k } is the random noise.
With respect to equation (17), the measurement set in the kth frame can be represented as follows:$${\mathbf{Z}}_{k}^{\left[q\right]}=\left\{\bigcup _{i=1,\cdots ,\left{X}_{k}\right}\mathrm{}{T}_{k}^{\left[q\right]}\left({\mathbf{x}}_{i,k},{v}_{i,k}^{\left[q\right]}\right)\right\}\cup {C}_{k}^{\left[q\right]},$$(21)where$${T}_{k}^{\left[q\right]}\left({\mathbf{x}}_{i,k},{v}_{i,k}^{\left[q\right]}\right)=\{\begin{array}{cc}\mathrm{\varnothing}& {H}_{\mathrm{miss}}\\ \left\{{\tau}_{q}\left({\mathbf{x}}_{i,k}\right)+{v}_{i,k}^{\left[q\right]}\right\}& {\overline{H}}_{\mathrm{miss}}\end{array},$$(22)
In equation (21), ${C}_{k}^{\left[q\right]}$ is the false measurement set, and ${T}_{k}^{\left[q\right]}\left(\cdot \right)$ is the true measurement set with noise variable ${v}_{i,k}^{\left[q\right]}\sim N\left(0,{\sigma}_{v}^{2}\right)$. In equation (22), ${\tau}_{q}\left({\mathbf{x}}_{i,k}\right)$ is given in equation (3), and H _{miss} and ${\overline{H}}_{\mathrm{miss}}$ are the missed detection and the detection hypotheses, respectively. For the TDOA measurement, elements in the false measurement set ${C}_{k}^{\left[q\right]}$ are in range [−τ _{max,} τ _{max}], and the number of the false measurements $\left{C}_{k}^{\left[q\right]}\right$ follows a Poisson distribution with an average rate of λ _{ c }.
3.3.2 Adding speaker identifier to the source state
The speaker’s inherent character is not contained in the raw representations of the source state [27]. The relationship between the measurement and the source state in the particle is primarily determined by the likelihood function. In addition, if one speaker appears, it is difficult to know whether this source spoke previously from the state vector in equation (16).
To improve the performance of the RFSSMC, we conduct the source state augmentation by appending the speaker’s identifier to the source state. Then, the state can be extended as follows:$${\mathbf{\xi}}_{k}={\left[{\mathbf{x}}_{k}^{T},{\gamma}_{k},{\eta}_{k}\right]}^{T},$$(23)where x _{ k } is the source position, γ _{ k } is the birth time of the source, and η _{ k } is the speaker identifier, which can be used to associate a new session of one specific speaker to a previous time, i.e., when a new source is born, we can determine whether this speaker appeared previously.
3.3.3 Birth process
In the PF algorithm, the appearance of a new source is allowed by sampling in a probability of P _{birth}, and then the potential new target is born. In our implementation, the birth process is further controlled by the recognized speaker identifiers. Here, the results recognized from the separated signals across different microphone pairs may differ; thus, to obtain more consistent speaker identifiers, we employ the SR result of the best reliable MP to guide the birth process of a new source. As discussed in Section 3.1, the SR result of the MP with the largest separating degree is used to add the identifier to the particles.
In addition, the speaker identifier is introduced to the particle; thus, when new frames arrive, only the speaker not contained in the particle needs to be added. Therefore, when no new speaker appears with the measurement, the birth process of a new source is denied, which reduces the birth behavior of false alarms.
3.3.4 Survival of sources
In the particle evolution process, if one source in the particle survives in the current frame, the state will be updated. After introducing the speaker identifier, three cases should be considered.

If $\widehat{N}=N$ _{ i }, i.e. the number of sources detected by DUET BSS equals the number of sources contained in particle i, the state of all sources should be updated according to the identifier.

If $\widehat{N}<{N}_{i}$, sources will disappear at high probability. In this case, a trust probability of the number of sources estimated by DUET BSS, i.e. P _{BSS}, is defined here. When this event occurs, sources that are not associated with any measurement should be removed from the particle, thereby achieving rapid handling of death behavior, and the sources that are associated with the acquired TDOA measurements will evolve.

If $\widehat{N}>{N}_{i}$, the state of each survival source should be updated according to the speaker identifier.
3.3.5 Update weights of particles
When the state transition density is used as the proposal distribution [27], the weights of the particles in frame k can be updated as follows:$${\omega}_{k}^{\left(i\right)}=\prod _{q=1}^{Q}\mathrm{}{g}_{q}\left({\mathbf{Z}}_{k}^{\left[q\right]}{\mathbf{X}}_{k}^{\left(i\right)}\right){\omega}_{k1}^{\left(i\right)}.$$(24)
The likelihood function of the TDOA set of the qth microphone pair for the ith particle is given as follows:$${g}_{q}\left({\mathbf{Z}}_{k}^{\left[q\right]}{\mathbf{X}}_{k}^{\left(i\right)}\right)=\sum _{{\stackrel{\u0303}{\mathbf{Z}}}_{k}^{\left[q\right]}\subseteq {\mathbf{Z}}_{k}^{\left[q\right]}}\mathrm{}{g}_{\mathrm{true},q}\left({\stackrel{\u0303}{\mathbf{Z}}}_{k}^{\left[q\right]}{\mathbf{X}}_{k}^{\left(i\right)}\right){c}_{q}\left({\mathbf{Z}}_{k}^{\left[q\right]}{\stackrel{\u0303}{\mathbf{Z}}}_{k}^{\left(q\right)}\right),$$(25)where ${g}_{\mathrm{true},q}\left({\stackrel{\u0303}{\mathbf{Z}}}_{k}^{\left[q\right]}{\mathbf{X}}_{k}^{\left(i\right)}\right)$ is the likelihood function of the true TDOAs, and ${c}_{q}\left({\mathbf{Z}}_{k}^{\left[q\right]}\right)$ is the probability density function (PDF) of the false TDOAs. The conditional PDF in equation (25) can be further decomposed into the case of single source and single measurement as follows:$${g}_{\mathrm{true},q}\left({\mathbf{Z}}_{k}^{\left[q\right]}{\mathbf{X}}_{k}^{\left(i\right)}\right)={P}_{\mathrm{miss}}^{nm}(1{P}_{\mathrm{miss}}{)}^{m}\sum _{1\le {i}_{1}\ne \cdots \ne {i}_{m}\le n}\mathrm{}\prod _{j=1}^{m}\mathrm{}{g}_{q}\left({z}_{j,k}^{\left[q\right]}{\mathbf{x}}_{{i}_{j},k}\right),$$(26)where n and m are the true number of sources and measurements, respectively, and P _{miss} is the probability of missed detection. A more detailed derivation can be found in Appendix B of the literature [27]. By introducing the speaker identifier with the TDOAs, the inference of the likelihood for each source in equation (26) is to take a full probability expansion as follows:$$g\left({z}_{j,k}^{\left[q\right]}{\mathbf{\xi}}_{k}\right)=\sum _{r=1}^{\stackrel{\u0303}{N}}\mathrm{}g\left({z}_{j,k}^{\left[q\right]}{\mathbf{\xi}}_{k},{H}_{r}\right)g\left({H}_{r}{\mathbf{\xi}}_{k}\right),$$(27)where $\stackrel{\u0303}{N}$ is the total number of individual speakers in the scenario, ξ _{ k } is the extended state in equation (23), H _{ r } represents the event that this measurement originated from the rth source, r = 1, …, $\stackrel{\u0303}{N}$, and g(H _{ r }ξ _{ k }) is the transition probability of speaker identifier, which is introduced here and is set according to the speaker recognition accuracy. For the probability of the observations given the hidden variables $g\left({z}_{j,k}^{\left[q\right]}{\mathbf{\xi}}_{k},{H}_{r}\right)$,

If r = η _{ k }, we obtain:

If r ≠ η _{ k },
Here, the measurement can be associated to a specific source in the particle; thus, the likelihood of η _{ k } is employed in equation (28). In addition, when the identifier turns to be the other, the speaker recognition result is used in equation (29), and an intuitive and technical method to compute the likelihood of H _{ r } is to use the normalized loglikelihood ratio in equation (15) as follows:$$g\left({z}_{j,k}^{\left[q\right]}{H}_{r}\right)=0.5\cdot \mathrm{tanh}\left(\beta \cdot g\left({\mathbf{o}}_{k}{H}_{r}\right)\right)+0.5.$$(30)where tanh is the Hyperbolic Tangent function, β is a control parameter, and g(o _{ k }  H _{ r }) is the score S _{tar} in equation (15). Generally, in equation (27), the measurement associated with the true source will contribute the most when updating the state, thereby allowing us to track different speakers without ambiguity.
3.3.6 Flowchart of proposed tracking approach
The detailed flowchart of the proposed approach (for simplicity, only a single MP is depicted) is given in Figure 4.

Shorttime Fourier transform (STFT) is performed on the received signals.

Process each MP with DUET method in the TF domain, and obtain the 2D histogram

Remove the dimension of LR to form a onedimensional (1D) histogram.

Determine the number of DUET peaks, cluster the 1D histogram samples using the clustering method and employ the cluster center as TDOA measurements.

Select the MPs with the greatest number of measurements (possibly multiple pairs) and calculating the separating degree.

Separate the received signals from the selected MPs.

Extract the MFCC features from the separated signals.

Recognize speakers according to the MFCCs.

For the birth process, according to the SR result of the MP with the greatest separating degree, allow a source not contained in the particle to be born. For the survival of sources, computing the likelihood according to equation (27) to update the source state and update the particle weights according to equation (24).
Figure 4 Flowchart of proposed approach (Only one microphone pair is depicted). 
Note that the separated signals of all MPs obtained by the DUET method are used for SR, and the number of separated signals $\widehat{N}$, may differ across different MPs. To obtain more consistent speaker identifiers in the birth process, only the recognized identifiers from the MP with the largest separating degree are used. In step (8), for each microphone pair, the weight update process utilizes the measurements together with the corresponding recognized speaker identifiers and the scores.
4 Experimental evaluation
4.1 Experimental setup
To evaluate the proposed approach, the IMAGE method [43] was used to generate the room reverberation. Here, the reverberation levels are set by adjusting the reflection coefficients of the walls. All raw voice fragments were taken from the speech recordings downloaded from the Scientific American podcasts [45]. In addition, to obtain clean signals from the specific speakers, the background audio, e.g. the speech of nonspecific speakers and music, were removed from the original wave files. Finally, the data corpus is uploaded to Figshare repository [46]. After receiving the signals by microphones, they are sampled at 8000 Hz and then framed in two levels. The firstlevel frame was used to accumulate a desired DUET histogram and obtain the TDOA measurements, the length of which should be considered. In this context, longer frames are favorable because more credible recovered signals can be obtained, and more robust speech features would be extracted from longer frames. However, overly long frames will not reflect the motion of sources over time. The secondlevel frame, which is based on the firstlevel frame, was used to obtain the TF bins for the DUET method and extract the MFCCs. In the following sections, the length of the firstlevel frame was set to various lengths of points, and also various overlapping were considered. For the secondlevel frame, the length was set to 1024 points with overlap of 256 points uniformly. Assume that the length of the firstlevel frame is 4096 points. In this case, 13 subframes will be used to form the DUET histogram and extract the MFCCs in such firstlevel frame.
The parameters of the DUET method were set as follows. After converting the 2D histogram to the 1D histogram, the length of the smoothing windows was set to three bins in equation (13), and the Kmeans clustering method was used to cluster the TFs with the locations of the peaks as the initial points. Note that two additional assumptions are considered for indoor scenarios in this paper.

In one time frame, at most one speaker source can be born.

Only few speakers are active simultaneously, and the number of speakers is not known in advance.
The first assumption makes it easy to implement the RFSSMC tracking method. The second assumption is to ensure sufficient performance of the speaker recognition algorithm in short time scenarios. Note that the second assumption is valid in practical application. In the following experiments, at most two simultaneously active speakers were considered [13, 27]. Then, at most two TDOAs from each MP were collected for subsequent tracking. After the energybased voice activity detection [31], the TDOAs when their corresponding DUET peak values were greater than 5% of the height of the highest peak were used. In addition, the weight α was set to 0.4 in equation (14b).
When training the speaker model, i.e., registering speakers, speeches of nine male and eleven female speakers (duration of each speech fragment: 1–2 min) were used to train the UBM and GMM speaker models, and 32 Gaussian components were present in the GMM models. During tracking, at most four different enrolled speakers were considered, i.e., $\stackrel{\u0303}{N}=4$ in equation (27). When extracting the speech feature [44], 36 dimensions of the MFCCs and their deltas were obtained.
Here, the Langevin process [27] was used to describe the dynamic model. In the survival process, the probability of the trust probability P _{BSS} (Sect. 3.3.4) was set to 0.75 and the transition probability in equation (27) was set to 0.3. For the normalized loglikelihood ratio in equation (30), the control parameter β was set to 1. The other main parameters of the baseline RFSSMC method were set as follows: P _{birth} = 0.05, P _{death} = 0.01, P _{miss} = 0.25 and λ _{ c } = 1. In addition, the number of the particles for RFSSMC was set to 500. Without explicit explanation, each experiment was run 500 times.
4.2 Performances of blind source separation and speaker recognition methods with short frame
Before tracking the speakers, the performances of the BSS and SR methods with short frame were evaluated first. Since it is not necessary to separate a single source signal, the BSS performance evaluation involved two sources simultaneously. Under the anechoic condition with SNR being 30 dB, the evaluation of the BSS and the SR is described in the followings.
In this case, the room size was set to 3 m × 3 m × 3 m (length × width × height) and a total of six microphone pairs with intermicrophone spacing of 4 cm were used to record signals. This close distance was adopted to avoid spatial aliasing. Note that the implementation of DUET with larger MP spacing can be found in the literature [11]. In addition, a fixed height of 1.5 m was set for all microphones and sources for 2D tracking. The specific MP placement is shown in Figure 5. As can be seen, four MPs are placed near walls to detect the sources distant from the room center, and the two MPs located at the center of the room were used to facilitate better separation. The two speakers were stationary at the (0.5, 0.8) and (2.5, 2.2) positions. The duration of speeches uttered by these speakers was 15 s.
Figure 5 Placement of six microphone pairs and two speakers in scenario 1. For MP5, the microphones at (1.50, 1.48) and (1.50, 1.52) were paired. For MP6, the microphones at (1.48, 1.50) and (1.52, 1.50) were paired. 
To evaluate the BSS results, the recoveredsignaltonoise ratio (RSNR) of the restored speech was used (in dB).$$\mathrm{RSN}{\mathrm{R}}_{i}=10\mathrm{log}\left(\frac{\sum _{t=1}^{T}\mathrm{}{s}_{i}^{2}\left(t\right)}{\sum _{t=1}^{T}\mathrm{}{\left({s}_{i}\left(t\right){\widehat{s}}_{i}\left(t\right)\right)}^{2}}\right).$$(31)
Here, ${s}_{i}$ and ${\widehat{s}}_{i}$ are the original and restored speech signals, respectively. In addition, each signal was normalized in advance, and the compared two signals are timealigned prior to calculating the RSNR values.
When the length of the firstlevel frame was set to 4096 points with overlapping of 1024 points, the performance of BSS and SR is shown in Table 1. In this table, each row represents the results in each frame, and columns two and three show the RSNR values of the two separated signals, respectively. Column four shows the average RSNR, and columns five and six indicate whether the separated signals were recognized correctly. As can be seen, only a small number of errors occurred in speaker recognition despite low RSNR values.
Performances of BSS and SR (√ denotes recognizing correctly, and ×denotes recognizing wrongly).
Due to the fact that longer frames contain more information about the speaker’s inherent characteristics, which allows the BSS method to realize high performance, it is preferable to select longer frames to recognize speakers. To evaluate the realtime tracking, after testing various frame lengths and overlapping, a frame length of 4096 points with overlap of 2048 points for the firstlevel frame was used in the subsequent performance evaluation.
4.3 Evaluation of acoustic source tracking
To evaluate the estimation of the number of sources, the probability of correct estimation of the number of speakers is defined as follows:$$P\left(\left{\widehat{\mathbf{X}}}_{k}\right=\left{\mathbf{X}}_{k}\right\right).$$(32)
When the number of sources is correct, the conditional mean distance is used to assess the distance error [27] as follows:$${E}_{{\widehat{\mathit{X}}}_{k}}\left\{d\left({\mathbf{X}}_{k},{\widehat{\mathbf{X}}}_{k}\right)\mathrm{correct}\mathrm{speaker}\mathrm{number}\mathrm{estimate}\right\}.$$where$$d\left({\mathbf{X}}_{k},{\widehat{\mathbf{X}}}_{k}\right)=\underset{{j}_{i}\in \left\{1,\cdots ,n\right\},i=1,\cdots ,n{j}_{i}\ne {j}_{k},\forall i\ne k}{\mathrm{min}}\sqrt{\frac{1}{n}\sum _{i=1}^{n}\mathrm{}\left\right{\mathbf{x}}_{i,k}{\widehat{\mathbf{x}}}_{{j}_{i},k}\left\right}.$$(33)
Here, the minimization attempts to identify a proper assignment between the estimated positions and the ground truth.
In addition, two existing methods were compared as the baseline methods to verify the superiority of the proposed method’s tracking performance.

DUETRFSSMC method, which is based on the native RFSSMC [27], only uses DUETbased TDOA measurements.

DUETRBPF method [13], which is constructed in an RFS Bayesian filtering framework and implemented by the RaoBlackwellization PF method [13].
In RBPF, the source state is decomposed into the source position and a data association variable. Then, the source position can be marginalized out using an extended Kalman filter (EKF), and only the association variable needs to be estimated by the PF. In our implementation, at the EKF stage, the position state was initialized at a random position in the room with a velocity of 0 m/s in both directions, the initial variance was as a diagonal matrix with diagonal elements of [1, 0, 1, 0], and the measurement noise variance was set to 5 × 10^{−9}. For fairness, all tracking methods employed DUET to generate TDOA measurements.
4.3.1 Tracking under anechoic condition
In this subsection, we consider a more complicated scenario, as shown in Figure 6. In this scenario, speaker 1 moved and spoke continuously from frames 1 to 29. After 10 frames of silence, this source moved and spoke continually for 34 frames. During this time, speaker 2 moved and spoke from frames 21 to 49; thus, we considered a total of 74 frames. Detailed trajectories are shown in Figure 6.
Figure 6 Placement of microphones and trajectories of two sources in scenario 2. 
Figure 7 shows the estimated trajectories for a single trial under the anechoic condition with a noise level of 30 dB. In this figure, the trajectories of the ground truth were marked with green solid lines, and different estimated speakers’ trajectories were represented using different colors and line styles. We found that the DUET method’s TDOA measurements can work well for tracking two speakers. As shown in Figure 8, compared to the DUETRFSSMC and DUETRBPF methods, the proposed method demonstrates better number estimation and performance in terms of the conditional mean distance.
Figure 7 Estimated tracking trajectories by proposed method for a single trial under anechoic condition, where the red circles represent the microphones. 
Figure 8 Tracking performance of two acoustic sources under anechoic condition. 
4.3.2 Tracking under adverse conditions
The probability of the correct number of sources in equation (32) only assesses the detection of sources; thus, the wellmatched position estimates in equation (33) are not considered in overestimation or underestimation of the number of sources. For further comparisons, the optimal subpattern assignment (OSPA) metric [47] was considered to evaluate the number estimation and distance error jointly.
For any finite subsets X = {x _{1} … x _{m}} and Y = {y _{1}, ⋯, y _{ n }}, the OSPA distance is defined as follows:$${d}_{\mathrm{OSPA}}\left(\mathbf{X},\mathbf{Y}\right)={\left(\frac{1}{n}\left(\underset{\pi \in {\mathrm{\Pi}}_{n}}{\mathrm{min}}\sum _{i=1}^{m}\mathrm{}{d}^{\left(\kappa \right)}{\left({\mathbf{x}}_{i},{\mathbf{y}}_{\pi \left(i\right)}\right)}^{p}+{\mathbf{\kappa}}^{p}\left(nm\right)\right)\right)}^{1/p},$$(34)where${d}^{\left(\kappa \right)}\left(\mathbf{x},\mathbf{y}\right)=\mathrm{min}\left(\kappa ,d\left(\mathbf{x},\mathbf{y}\right)\right)$ is the distance between x and y cut off at $\kappa >0$, 1 ≤ p < ∞ and ∏_{ n } is the set of permutation on {1, ⋯, n} for any n ∈ {1, 2, ⋯}. In the following parts, κ and p are set to 3 and 2, respectively.
To evaluate the proposed method under adverse conditions, various reverberation levels with the same SNR value (30 dB) and different noise levels with the same reverberation time (0.20 s) were considered. Figures 9a and 9b show the OSPA distances under different echoic and noise conditions, respectively. As shown in Figure 9a, as the reverberation level increased, the performance of all methods degraded gradually. The variance of the importance weights can be reduced in the RaoBlackwellization step [13]; thus, at low reverberation levels, the DUETRBPF method outperformed the raw RFSSMC filter (DUETRFSSMC). However, when the condition quality declines, the estimated trajectories began to deviate from the ground truth. Comparatively, after introducing the speaker identifier to the source state, we found that the proposed method outperformed the compared method nearly under all tested echoic conditions. As shown in Figure 9b, the proposed method works better in adverse conditions compared to the other methods, and demonstrates constant performance under SNR levels from 20 to 40 dB.
Figure 9 Tracking performance under (a) echoic with the SNR fixed to 30 dB and (b) noisy condition with the RT60 fixed to 0.20 s. 
4.3.3 Evaluation on sources close to each other
In many scenarios, the speakers may move close to each other, and the acquired measurements would be difficult to be discriminated, which will lead to the false measurementsource association. In the tracking process, with respect to the measurementsource association, several typical cases can be shown in Figure 10. In this figure, the green solid lines represent the true trajectories of sources, and the tracks with other colors represent the filtering results. For two true trajectories, they are conducted in parallel lines and the distance between them is set to 1 m. Figure 10a shows the result when tracking two targets correctly and there is no crosstalk between the two estimated tracks. Figure 10b shows the false measurementsource association when tracking one source (represented by the blue track list), i.e., one source used the measurement of other sources to update its state, which is due to the two targets being in close range and the measurements of different sources being hard to be discriminated. Figure 10c shows the interruption (the black and the blue track lists) while target’s tracking and the tracking list (represented by the black track list) deviates from the true trajectory. This paper introduces the TDOA measurement and its corresponding speaker’s inherent character into the tracking process, and puts forward to solve the problem above.
Figure 10 Tracking demos when two sources moves in close range. (a) Accurate tracking. (b) False association occurs while tracking. (c) Tracking list deviates from the true track. 
In order to reasonably evaluate the tracking performance when sources are in close range, two speakers’ cases are taken into account and supposed to move in parallel. This experimental conduction can ensure that the two targets have enough time to be in close distance. In this scenario, the room size and the placement of the microphones are illustrated in Figure 11. Concurrently, source 1 and source 2 start to move from (1.70, 1.00) and (2.30, 1.00), respectively. And the two sources move 19 frames at a speed of 0.40 m/s.
Figure 11 Scenario 3: two sources move in close range. 
Since there is no suitable assessment method, we define the revised OSPA distance (OSPAr) to measure the tracking performance for different sources. In a single experiment, the OSPAr is determined by the following steps:

Extract the center of each tracking list on the xaxis.

According to the distance from its center to source 1 and source 2, each track list is assigned to the true source close to it to form two list sets.

Compute the OSPA distance between each lists set and its ground truth trajectory, respectively.
The first two steps aims to assign the estimation trajectory into one true source according to the spatial position. Although this method is only suitable for the evaluation of the current scenario, the evaluation is sufficient to illustrate the superiority of the algorithm when the sources are in close range. At the second step, for one time frame, a source may correspond to null or more states in the estimated trajectory set. At the last step, the OSPA distance is used to measure the similarity between the two sets. Figure 12 shows the computed distances between the estimated under RT = 0.20 s and SNR = 30 dB, and two other baseline tracking methods are also compared. It can be seen that because the proposed method uses the speaker identity information associated with the measurement, we can distinguish the source and use the filtering algorithm for accurate tracking. Comparatively, the DUETRFSSMC and the hidden variable estimation based DUETRBPF methods perform slightly worse in discriminating the targets when they are in close range. It can also be seen that the tracking performance of different sources is not the same, which is because that the measurement of source 2 is more accurate.
Figure 12 OSPAr distance when RT=0.20 s, SNR=30 dB and the source’s spacing equals to 60 cm. (a) Proposed. (b) DUETRFSSMC. (c) DUETRBPF. 
When the targets are in close range, more experiments are conducted at different sources’ spacing. The relevant results are shown in Figure 13. It can be seen that better performance can be achieved at medium distances (60 cm). When the spacing between the two sources is too small (40 cm), the mutual interferences between two sources are very high, and the performance on discriminating sources will degrade.
Figure 13 Comparisons under different sources’ spacing. 
5 Conclusions
This paper uses the DUET method to separate the mixed signals and takes two acoustic sources as examples in tracking task. In the proposed method, the GMMUBM model is employed to quickly recognize speakers. In addition, the speaker feature is extracted from each frame, and the recognized identifier is utilized to improve tracking performance. Experiments were conducted to evaluate the feasibility and validity of speaker recognition in shorttime frames when one or two speakers appeared, and the results demonstrated the superiority in terms of measurementsource association. The propose method can also be extended to the case of multiple acoustic sources when more measurements are utilized.
This paper primarily considered the tracking step, and better speaker recognition algorithms are beyond the scope of this study; thus, only classic BSS and SR method were considered. However, in future, we expect that the proposed method can be improved by employing a better BSS or SR method.
Conflict of interest
The authors declare no conflict of interest.
Funding
This work was supported by the Basic Scientific Research project (JCKY2017207B042), National Natural Science Foundation of China (61673313, 61673317), Science and Technology on Underwater Test and Control Laboratory (6142407190508).
References
 J. Choi, J. Kim, N.S. Kim: Robust time delay estimation for acoustic indoor localization in reverberant environments. IEEE Signal Processing Letters 24, 2 (2017) 226–230. [CrossRef] [Google Scholar]
 M. Jia, J. Sun, C. Bao: Realtime multiple sound source localization and counting using a soundfield microphone. Journal of Ambient Intelligence & Humanized Computing 8, 6 (2017) 829–844. [CrossRef] [Google Scholar]
 B. LauferGoldshtein, R. Talmon, S. Gannot: A hybrid approach for speaker tracking based on TDOA and datadriven models. IEEE Transactions on Audio, Speech, & Language Processing 26, 4 (2018) 725–735. [CrossRef] [Google Scholar]
 S. Argentieri, P. Danes, P. Souères: A survey on sound source localization in robotics: From binaural to array processing methods. Computer Speech & Language 34, 1 (2015) 87–112. [CrossRef] [Google Scholar]
 Q. Zhang, Z. Chen, F. Yin: Speaker tracking based on distributed particle filter in distributed microphone networks. IEEE Transactions on Systems Man & Cybernetics Systems 47, 9 (2017) 2433–2443. [Google Scholar]
 G. Jiang, Y. Liu, Q. Kong, W. Hao, L. An: Study on acoustic time delay localization in boiler tube arrays considering effective sound velocity. Applied Acoustics 171 (2021) 107680. [CrossRef] [Google Scholar]
 E.A. King, A. Tatoglu, D. Iglesias, A. Matriss: Audiovisual based nonlineofsight sound source localization: A feasibility study. Applied Acoustics 171 (2021) 107674. [CrossRef] [Google Scholar]
 K. Weisberg, B. LauferGoldshtein, S. Gannot: Simultaneous tracking and separation of multiple sources using factor graph model. IEEE/ACM Transactions on Audio, Speech, & Language Processing 28 (2020) 2848–2864. [CrossRef] [Google Scholar]
 C.H. Knapp, G.C. Carter: The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, & Signal Processing 24, 4 (1976) 320–327. [CrossRef] [Google Scholar]
 L. Wang, T.K. Hon, J. Reiss, A. Cavallaro: An iterative approach to source counting and localization using two distant microphones. IEEE/ACM Transactions on Audio, Speech, & Language Processing 24, 6 (2016) 1079–1093. [CrossRef] [Google Scholar]
 S. Makino, T.W. Lee, H. Sawada: Blind speech separation. Springer, Netherlands (2007). [CrossRef] [Google Scholar]
 S. Rickard, O. Yilmaz: On the approximate Wdisjoint orthogonality of speech. IEEE International Conference on Acoustics, Speech, and Signal Procesing. Orlando, USA (2002) 529–532. [Google Scholar]
 X. Zhong, J.R. Hopgood: A timefrequency masking based random finite set particle filtering method for multiple acoustic source detection and tracking. IEEE Transactions on Audio, Speech, & Language Processing 23, 12 (2015) 2356–2370. [CrossRef] [Google Scholar]
 A. Griffin, A. Alexandridis, D. Pavlidi: Localizing multiple audio sources in a wireless acoustic sensor network. Signal Processing 107 (2015) 54–67. [CrossRef] [Google Scholar]
 W. Kai, R.V. Gopalan, A.W.H. Khong: Multisource DOA estimation in a reverberant environment using a single acoustic vector sensor. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 10 (2018) 1848–1859. [Google Scholar]
 W.H. Foy: Positionlocation solutions by Taylorseries estimation. IEEE Transactions on Aerospace & Electronic Systems. AES12 2 (1976) 189–194. [Google Scholar]
 H. Schau, A. Robinson: Passive source location employing spherical surfaces from timeofarrival differences. IEEE Transactions on Acoustics, Speech, & Signal Processing 35, 8 (1987) 1223–1225. [CrossRef] [Google Scholar]
 J.O. Smith, J.S. Abel: Closeform leastsquares source location estimation from rangedifference measurements. IEEE Transactions on Audio, Speech, & Language Processing 35, 12 (1987) 1661–1669. [Google Scholar]
 Y.T. Chan, K.C. Ho: A simple and efficient estimator for hyperbolic location. IEEE Transactions on Signal Processing 42, 8 (2002) 1905–1915. [Google Scholar]
 Y. Huang, J. Benesty, , GW Elko: Passive acoustic source localization for video camera steering. In: IEEE International Conference on Acoustics, Speech, & Signal Processing (2002) 909–912. [Google Scholar]
 Y. Huang, J. Benesty, GW Elko: An efficient linearcorrection leastsquares approach to source localization. In: IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (2001) 67–70. [Google Scholar]
 Y. Huang, J. Benesty, G.W. Elko: Realtime passive source localization: a practical linearcorrection leastsquares approach. IEEE Transactions on Speech Audio Process 9, 8 (2001) 943–956. [CrossRef] [Google Scholar]
 K. Yang, J. An, X. Bu: Constrained total leastsquares location algorithm using timedifferenceofarrival measurements. IEEE Transactions on Vehicular Technology 59, 3 (2010) 1558–1562. [CrossRef] [Google Scholar]
 Y. Weng, W. Xiao, L. Xie: Total least squares method for robust source localization in sensor networks using TDOA measurements. International Journal of Distributed Sensor Networks 7, 1 (2011) 1063–1067. [Google Scholar]
 D.B. Ward, E.A. Lehmann, R.C. Williamson: Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Transactions on Speech & Audio Processing 11, 6 (2003) 826–836. [CrossRef] [Google Scholar]
 M. Crocco, S. Martelli, A. Trucco, A. Zunino, V. Murino: Audio tracking in noisy environments by acoustic map and spectral signature. IEEE Transactions on Cybernetics 48, 5 (2018) 1619–1632. [CrossRef] [PubMed] [Google Scholar]
 W.K. Ma, B.N. Vo, S.S. Singh, A. Baddeley:, Tracking an unknown timevarying number of speakers using TDOA measurements: a random finite set approach. IEEE Transactions on Signal Processing 54, 9 (2006) 3291–3304. [CrossRef] [Google Scholar]
 L. Sun, Q. Cheng, Indoor multiple sound source localization using a novel data selection scheme. In: 48th Conference Information Sciences and Systems (2014) 1–6. [Google Scholar]
 A. Alexandridis, A. Mouchtaris: Multiple sound source location estimation in wireless acoustic sensor networks using DOA estimates: The dataassociation problem. IEEE Transactions on Acoustics, Speech, & Signal Processing 26, 2 (2018) 342–356. [Google Scholar]
 Y. Guo, H.Y. Zhu, X.D. Dang: Tracking multiple acoustic sources by adaptive fusion of TDOAs across microphone pairs. Digital Signal Processing 106 (2020), 102853. [CrossRef] [Google Scholar]
 M.F. Fallon, S. Godsill: Acoustic source localization and tracking using track before detect. IEEE Transactions on Audio, Speech, & Language Processing 18, 6 (2010) 1228–1242. [CrossRef] [Google Scholar]
 A. MasnadiShirazi, B.D. Rao: An ICASCTPHD filter approach for tracking and separation of unknown timevarying number of sources. IEEE Transactions on Audio, Speech, & Language Processing 21, 4 (2013) 828–841. [CrossRef] [Google Scholar]
 M. Taseska, E.A.P. Habets: Blind source separation of moving sources using sparsitybased source detection and tracking. IEEE Transactions on Audio, Speech, & Language Processing 18, 6 (2018) 657–670. [CrossRef] [Google Scholar]
 X.H. Zhong: A Bayesian framework for multiple acoustic source tracking. PhD Thesis, University of Edinburgh, 2010. [Google Scholar]
 T. May, S. van de Par, A. Kohlrausch: A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation. IEEE Transactions on Audio, Speech, & Language Processing 20, 7 (2012) 2016–2030. [CrossRef] [Google Scholar]
 K. Youssef, K. Itoyama, K. Yoshii: Simultaneous identification and localization of still and mobile speakers based on binaural robot audition. Journal of Robotics & Mechatronics 29, 1 (2017) 59–71. [CrossRef] [Google Scholar]
 F. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, F. Piazza: Localizing speakers in multiple rooms by using deep neural networks. Computer Speech & Language 49 (2018) 83–106. [CrossRef] [Google Scholar]
 A.V. Oppenheim, R.W. Schafer: DiscreteTime Signal Processing. Pearson International (2013). [Google Scholar]
 L. Sun, Q. Cheng: Indoor multiple sound source localization using a novel data selection scheme. In: 48th Conf Information Sciences and Systems (2014) 1–6. [Google Scholar]
 O. Yilmaz, S. Richard: Blind Separation of Speech Mixtures via TimeFrequency Masking. IEEE Transactions on Signal Processing 52, 7 (2004) 1830–1847. [CrossRef] [Google Scholar]
 K. Wu, V.G. Reju, A.W.H. Khong: Multisource DOA estimation in a reverberant environment using a single Acoustic vector sensor. IEEE Transactions on Audio, Speech, & Language Processing 26, 10 (2018) 1848–1859. [CrossRef] [Google Scholar]
 D.A. Reynolds, T.F. Quatieri, R.B. Dunn: Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 10 (2000) 19–41. [CrossRef] [Google Scholar]
 J.B. Allen, D.A. Berkley: Image method for efficiently simulating smallroom acoustics. Journal of the Acoustical Society of America 65, 4 (1979) 943–950. [Google Scholar]
 S.O. Sadjadi, M. Slaney, L. Heck: MSR identity toolbox v1.0: A MATLAB toolbox for speaker recognition research. Microsoft Research Technical Report, 2013. [Google Scholar]
 Online. Available: http://www.scientificamerican.com/podcast. [Google Scholar]
 Y. Guo: voice data for speaker recognition. figshare. Media (2022) https://doi.org/10.6084/m9.figshare.21388251.v3. [Google Scholar]
 D. Schuhmacher, B.T. Vo, B.N. Vo: A consistent metric for performance evaluation of multiobject filters. IEEE Transactions on Signal Processing 56, 8 (2008) 3447–3457. [CrossRef] [Google Scholar]
Cite this article as: Guo Y. & Zhu H. 2023. Joint shorttime speaker recognition and tracking using sparsitybased source detection. Acta Acustica, 7, 10.
All Tables
Performances of BSS and SR (√ denotes recognizing correctly, and ×denotes recognizing wrongly).
All Figures
Figure 1 Flowchart of TDOA Generation and DUET BSS (Only one microphone pair is depicted). 

In the text 
Figure 2 (a) Two speakers appear near the same hyperbola of MP 1 and (b) Corresponding accumulated DUET histogram. 

In the text 
Figure 3 DUET histograms formed by four microphone pairs. 

In the text 
Figure 4 Flowchart of proposed approach (Only one microphone pair is depicted). 

In the text 
Figure 5 Placement of six microphone pairs and two speakers in scenario 1. For MP5, the microphones at (1.50, 1.48) and (1.50, 1.52) were paired. For MP6, the microphones at (1.48, 1.50) and (1.52, 1.50) were paired. 

In the text 
Figure 6 Placement of microphones and trajectories of two sources in scenario 2. 

In the text 
Figure 7 Estimated tracking trajectories by proposed method for a single trial under anechoic condition, where the red circles represent the microphones. 

In the text 
Figure 8 Tracking performance of two acoustic sources under anechoic condition. 

In the text 
Figure 9 Tracking performance under (a) echoic with the SNR fixed to 30 dB and (b) noisy condition with the RT60 fixed to 0.20 s. 

In the text 
Figure 10 Tracking demos when two sources moves in close range. (a) Accurate tracking. (b) False association occurs while tracking. (c) Tracking list deviates from the true track. 

In the text 
Figure 11 Scenario 3: two sources move in close range. 

In the text 
Figure 12 OSPAr distance when RT=0.20 s, SNR=30 dB and the source’s spacing equals to 60 cm. (a) Proposed. (b) DUETRFSSMC. (c) DUETRBPF. 

In the text 
Figure 13 Comparisons under different sources’ spacing. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.