Particle-ﬁ lter tracking of sounds for frequency-independent 3D audio rendering from distributed B-format recordings

– Six-Degree-of-Freedom (6DoF) audio rendering interactively synthesizes spatial audio signals for a variable listener perspective based on surround recordings taken at multiple perspectives distributed across the listening area in the acoustic scene. Methods that rely on recording-implicit directional information and inter-polate the listener perspective without the attempt of localizing and extracting sounds often yield high audio quality, but are limited in spatial de ﬁ nition. Methods that perform sound localization, extraction, and rendering typically operate in the time-frequency domain and risk introducing artifacts such as musical noise. We pro-pose to take advantage of the rich spatial information recorded in the broadband time-domain signals of the multitude of distributed ﬁ rst-order (B-format) recording perspectives. Broadband time-variant signal extraction retrieving direct signals and leaving residuals to approximate diffuse and spacious sounds is less of a quality risk, and likewise is the broadband re-encoding to enhance spatial de ﬁ nition of both signal types. To detect and track direct sound objects in this process, we combine the directional data recorded at the single perspectives into a volumetric multi-perspective activity map for particle-ﬁ lter tracking. Our technical and perceptual evaluation con ﬁ rms that this kind of processing enhances the otherwise limited spatial de ﬁ nition of direct-sound objects of other broadband but signal-independent virtual loudspeaker object (VLO) or Vector-Based Intensity Panning (VBIP) interpolation approaches.


Introduction
The interactive rendering of recorded auditory scenes as virtual listening environments requires an approach to allow six Degrees of Freedom (6DoF) of movement for a variable listener perspective. The variable-perspective rendering of auditory scenes requires interpolation between static recording perspective positions. In existing research, this concept is often referred to as scene navigation or also scene walk-through. This contribution mainly refers to firstorder tetrahedral microphone arrays as means for recording surround audio for high fidelity applications.
While volumetrically navigable 6DoF recording and rendering are theoretically feasible, practical distributions of multiple static 3D audio recordings typically consider capturing perspective changes along the horizontal dimensions to enable walkable rendering of the auditory scene.
Perspective extrapolation of a single perspective for a shifted listening position has been considered in the Spa-MoS (spatially modified synthesis) method by Pihlajamäki and Pulkki [1,2] that estimates time-frequency-domain source positions by projecting directional signal detections of DirAC (directional audio coding [3,4]) onto a pre-defined convex hull (e.g. the room walls). This method is assumed to be accurate in spatial reproduction when the parallactic shift with regard to the original perspective recording stays small. Similarly, Plinge et al. [5] utilize DirAC in combination with known distance information to rotate and attenuate sources in a single-perspective recording to extrapolate its perspective. This approach would be expandable by multiple directional signals obtained via HARPEX by Barrett and Berge [6] for first-order signals, as Stein and Goodwin suggest in [7]. Higher-order Ambisonics signals are used in a work by Allen and Kleijn [8] that employs a matching-pursuit algorithm for multi-directional signal decomposition, takes into account estimated source distances, and labels reflections and direct sounds. Kentgens et al. [9] apply alternative multi-directional decomposition, e.g., the subspace methods SORTE/MUSIC for extraction and shifted re-encoding of direct components, complemented by a noise and diffuseness subspace that preserves ambient sounds. Birnie et al. [10] introduce a sound field translation method for 6DoF binaural rendering based on sparse plane-wave expansion for near and far sources arranged on two rings around the higher-order recording perspective. Altogether, the single-perspective extrapolation approaches require either parametric time-frequency processing or higher order microphone arrays to achieve directional definition, and their extrapolation range depends on how successful distance information is guessed or estimated. Alternatively, Bates and O'Dwyer [11], Lee et al. [12] employ a more classical, spaced array augmented by controllable-directivity microphones to simulate an extrapolated listening perspective.
Multiple perspectives contain additional information needed for explicit acoustic source localization that enlarges the supported range of shifted listening perspectives with high spatial definition. Brutti et al. [13,14], Hack [15] or Del Galdo et al. [16,17] introduce object localization methods using maps of the acoustic activity to localize or triangulate the sources within a scene. In [13][14][15], sequential peak picking algorithms are proposed to avoid erroneous detections in correlation-and intensity-based triangulation respectively. The virtual microphone method [16,17] utilizes the detected location every short-time frequency bin to assemble a virtual microphone signal of an arbitrary parametrically rendered directivity pattern, what in principle would allow parametric 6DoF rendering. Zheng [18] combines sound object detection with time-frequencydomain signal separation to extract direct signals for sound field navigation. Particle filters to track sound sources detected in reverberant environments were described by Ward et al. [19], and Fallon et al. [20] introduce a particle filter-based acoustic source tracking algorithm for a timevarying number of sources. Probabilistic detection was combined with particle filters for directional tracking of sounds was introduced by Valin et al. [21,22] for robotic applications, which was later adapted by Kitić and Guérin [23] to a first-order Ambisonics application.
Multi-perspective recordings also permit perspective interpolation methods with information about source locations staying implicit. Tylka and Choueiri [24][25][26] proposed interpolatory spherical-harmonic re-expansion at low frequencies for multi-perspective firstor higher-order Ambisonic recordings. Similar to the simplified treatment of high-frequencies in Tylka et al., there is also a number of simplistic broadband signal interpolation methods working in the time domain, only. Such methods avoid any risk of introducing musical noise artifacts, which can happen in time-frequency-domain processing. Mariette et al. [27] mix the first-order Ambisonics signals of the three nearest recording positions proportional to their proximity, similar to the proposal of Schörkhuber et al. [28] that introduces additional amplitude-dependent gains. Patricio et al. [29] propose a distance-based linear interpolation between higher-order Ambisonic recordings in the time domain, in which the higher-order content of distant microphones is faded out, while proximal recording perspectives remain unaltered. Multiple perspective recordings are mapped to multiple surround playback rings of virtual loudspeaker objects (VLOs) by Grosche et al. [30] and Zotter et al. [31], of which the direction and amplitude (involving a distance and directivity function) vary with the listener position and are employed in higher-order re-encoding of the VLO signals. A good introduction to scene recording and sound field interpolation especially in multimedia VR applications is given by Rivas-Mendez et al. [32].
For simplicity, simulations and experiments in this contribution deal with an equidistant grid of recording perspectives volumetrically distributed within a homogeneous auditory scene, however the approach introduced is more general. We introduce a multi-perspective interpolation method that merges and extends detection/tracking and broad-band signal processing concepts found in literature. A broadband signal extraction and rendering method is utilized for artifact-free signal processing in combination with automatic signal detection and position estimation for higher spatial accuracy. The estimated position of any detected object is used to steer broadband beamformers at the nearest recording positions to capture the object's direct sound. Weighted and delay-compensated combinations of the extracted signals yield approximated direct signals, while residual signals with direct-sound directions suppressed aim to reintroduce enveloping components of the diffuse sound field. Signal extraction and encoding procedures are described in Section 2 and scene analysis procedures in Section 3. Detection accuracy is technically investigated in Section 4 under varying SNR conditions. To assess the performance and achievable improvement of the proposed algorithm applied to a simple acoustic scene recording with static objects, a two-part listening experiment compares the rendering method with two existing broadband 6DoF rendering methods in Section 5, for a static and a moving listener.

Frequency-independent 6DoF Rendering
Given the listener position, the microphone array positions, and assuming to know the sound source positions, we can compute the signals customized to the acoustic perspective of a single virtual listener, delivered as a stream of higher-order Ambisonics [33] signals. This delivery format features the benefits of modular decoders facilitating playback on headphones or loudspeaker layouts and the Ambisonic sound field rotation operator. In case there are multiple virtual listeners, the listening-position-dependent processing steps are carried out interactively and separately for each listener, excluding the sound-scene analysis steps that are pre-computed offline.

Signal encoding and decoding
The multi-channel Ambisonics signal v(t) of the listener perspective of the order N is computed by multiplication of a single-channel signal with the weights of an encoder y N (h). Such an encoder consists of the N3D-normalized spherical harmonics y N ðhÞ T ¼ ½Y 0 0 ðhÞ; Y À1 1 ðhÞ; . . . ; Y 1 1 ðhÞ; . . . ; Y N N ðhÞ evaluated at the unit-length direction vector h = [cosu sin0, sinu sin0, cos0]. The theoretical background of spherical harmonics and the concept of Ambisonics can be found in literature, e.g. [33], and practical implementation of encoders and decoders alike is easily accomplished with libraries such as introduced by [34]. 1 We will encode the listening-position-dependent object direct signals (cf. Sect. 2.2) and the residual signals (cf. Sect. 2.3), depending on the relative direction vector h to the listener. For S signals in total, the signals s i (t), their gains g i , and the instantaneous direction h i = h i (t) of each signal, encoding is defined as Rotation of the Ambisonics perspective is necessary with headphone playback (Fig. 1). It enables taking into account the listener's head rotation to simulate a static acoustic scene outside the headphones. This is done by applying a (N + 1) 2 Â (N + 1) 2 rotation matrix R(u, 0, c) (cf. [35,33]) to the Ambisonics signal To render the headphone signals from the variable-rotation Ambisonics signalṽ, the MagLS binaural decoder is used, as described in [36]. With the exception of this binaural decoder, no frequency-dependent signal processing is introduced by the signal processing proposed. The upcoming sections introduce the methods to extract object direct signals and residual signals from the multi-perspective recordings, and they define gain and direction values for the encoding step in (1).

Object direct signal extraction
The object direct signals are approximations of the signal emitted from audible objects in the scene arriving at the virtual listener. Their positions are assumed to be known here; Section 3 thereafter introduces position estimation. Figure 2 shows the positions of three microphone arrays, an object position, and the listener position, which are all required to approximate the direct signal of the object as a weighted sum. In general, for each known or estimated sound object with position inside the scene, a simplex of surrounding microphone arrays is selected to define the closest set of surrounding recordings taken. The signals at its vertices are likely to capture the cleanest instances of the object's direct sound. Initial weights for the signals at these vertices are obtained from the area coordinates, the barycentric coordinates of a point in the triangle (Fig. 2) as the typical simplex when recording positions are distributed horizontally, cf. (7); or within a tetrahedron if recordings were distributed volumetrically. But first, the microphone arrays at any of these position are a spherical constellation of transducers, and h m,j denotes the transducer direction vectors. In our case, the index range is j = 1 . . . 4, as the arrays considered are tetrahedral Oktava MK 4012 [37]. The tetrahedral cardioid microphone signals of each recording perspective are encoded into N3D-normalized first-order Ambisonics by assuming the jth recorded transducer signals of the perspective p is denoted as s p,j (t). From this, a signal is extracted by applying beamforming vector g that steers towards the object position, here denoted as Àh op,i , which is illustrated in Figure 2. This beamforming vector is  modelled as on-axis-normalized maximum-directivity (in first order: hypercardioid) and can be computed as where y 1 is the encoder of first order in the negative object-perspective direction Àh op,i . To combine the three beamforming-extracted signals of an object from the triplet i = 1 . . . 3 of into a single direct signal, a combined gain is defined for the object s and perspective i Herein, the gain g dir,s,i denotes the areal or barycentric coordinate weight, cf. [38]. Assuming a projected sound object positionx 2D;s , it favors the closest perspective from the given triplet of projected positions p 2D,1 , p 2D,2 , p 2D,3 and yields g tri;s;2 g tri;s;3 The remaining value is computed by g tri;s;1 ¼ 1 À ðg tri;s;2 þ g tri;s;3 Þ: ð9Þ The factor g dir,s,i quantifies the alignment of the objectperspective directions h op,i with the object-listener direction h ol,s . Its task is to favor perspectives that are directionally aligned with the direction of radiation from the source to the listener, and it hereby favors the most suitable surrounding perspectives in terms of source directivity. It is formalized as cardioid of the order a. Again, the vectors h op,i and h ol,s are unit vectors and illustrated by Figure 2. The signal encoding as introduced with (1) requires direction vector, signal and gain. The object-listener direction h ol,s is employed to compute the encoder (Sect. 2.1) for the approximated object direct signal, which is the combination of the areal coordinate gain-weighted (6) and delay-compensated extracted signals (4) Here, delay compensation is done based on the speed of sound c and distance differences Át s,u = c À1 (d ol,s Àd op,s,i ), where the distances d ol,s and d op,s,i denote the object-listener and object-perspective distances for the object s and perspective i = 1 . . . 3 in the triplet, respectively, see Figure 2.
To model a realistic distance-dependent amplitude attenuation of the signals within the Ambisonic listener perspective the areal coordinate gains are first multiplied by the object-perspective to object-listener distance ratio. Then the combination is the gain that is employed in (1) and depends on the distance between listener and source. It is limited to a maximum of 4 (+12 dB) to make avoid excessive boosts whenever d ol,s becomes small.

Object direct signal suppression (residual signals)
In the optimal case, the approximated object direct signals (11) exclude all room information such as early reflections and late reverberation. They provide a clean signal for accurate directional perception, however do not convey a realistic room impression to the listener. To this end, the residual signals are introduced. A similar concept of direct signal suppression, despite in the higher-order Ambisonics domain, was employed in [9] to extract ambient components.
Here, the concept of a residual signal is implemented in terms of the virtual loudspeaker object (VLO) approach [30,31] that is illustrated in Figure 3a. Each perspective p p holds a number of microphones of the directions h m,j . Each direction and microphone signal is represented by a VLO that is positioned at a finite distance R from the perspective, p p + Rh m,j . The normalized direction vector from the listener position to the virtual loudspeaker objects as well as the corresponding distances are computed as and respectively. As described in [30,31], the VLO gains should depend on the distance and in direction parametrically (from unity to cardioid) Here, the gain (15) attenuates distant VLOs by 1 r . Too close ones are attenuated by r to maintain a robust result and erroneous localization. The gain (16) ensures that listeners do not hear sound far behind a VLO, which is always onaxis oriented towards p p , but not abruptly so when walking through the VLO position. And the direction / p,i in (13) represents the parallactic displacement at a shifted listening position. The VLO approach achieves an enveloping and spatially plausible reproduction when used with multi-perspective microphone arrays distributed in the recorded scene. There is potential for improvement in the spatial definition of its direct-sound imaging.
For this work, the VLO method is now modified to serve as a residual-signal renderer complementing the objects direct sound signals. Here, the method encodes the 4P residual signals to the listener perspective using (13) as directions in (1). It is however necessary to exclude the direct signals from the VLO signals so the spatial accuracy gained by object signal encoding (Sect. 2.2) is not diminished. To this end, de-emphasis is applied to each microphone-array perspective with the goal of suppressing direct signals that belong to identified sound objects. This is done by gains depending on the object-perspective directions and the object's direct signal amplitude applied to the signals s p,i (t) for all perspectives p = 1 . . . P and transducers i = 1 . . . 4. Extending the gains (15) and (16) Figure 3b illustrates the concept of this de-emphasis term suppressing direction towards the objects, which is the product of S directional gain patterns. Each of them is a mixture of unity and cardioid directivity, oriented such that the cardioid suppresses the object signals. It is defined for the array transducer directions h m,i as with the single-object de-emphasis patterñ for which the exponent b controls the width of the directional notch and a p,s (t) 2 [0,1] permits to control the depth of the notch, or to release it for distant or quiet sound-object signals. For this purpose, a p,s (t) is defined depending on the object-perspective distance and moving RMS value of " s s ðtÞ (11) as and g dis (d op,p , R de ) from 15 using reference distance R de . The value of the threshold RMS s for each sound object signal is determined by pre-computation given the recordings, or it can be defined manually.
3 Estimation of object positions Figure 4 provides an overview of the procedure to estimate the sound object positions necessary for the rendering algorithm as introduced in previous sections. Given the frequency-domain microphone array surround signals, the direction-of-arrivals (DOAs) of single-frequency components are estimated and combined into DOA maps. This is similar in concept and application as the DOA histograms in [15,18] and explained in Section 3.1. Section 3.2 introduces the method to intersect the directional information, to compute combined values, described as the acoustic activity map. Together with the subsequent sequential peak picking algorithm (Sect. 3.3), these concepts are also discussed in [15,13,14]. After selection of the instantaneous set of peaks by a sequential algorithm, these are evaluated in terms of probabilistic measures for sound object emergence, continuous activity and false detections. Section 3.7 explains the computation of these measures. They inform the decision making regarding the instantiation of particle filters for peak tracking. We use these particle filters, a well known Monte-Carlo probabilistic estimation technique and application-specifically described in Section 3.5, to track the three-dimensional position of the acoustic activity peaks as a time-coherent trajectory, expanding the method introduced in [22,23].

Single-perspective DOA map
Direction-of-arrival estimation is applied frame-wise to the surround microphone array signals. The applied approach is the magnitude sensor response method, introduced in [39]. In [40] it was extended to the smoothed magnitude sensor response. It is applied in the frequency domain and done by frequency-wise computation of the covariance matrix R (k) of the frequency-domain signal data S (k) for bins k = 1 . . . K at the time instant m as an average over M time frames of the microphone array frequency bin magnitudes S (k) . A subsequent eigenvalue decomposition gives us the possibility to further decompose it into signal and noise subspace. This is done by selecting L eigen values k ðkÞ l per bin k. As used in [40], this can be a fixed number L or be variable and computed with methods such as SORTE introduced in [41]. The case of a first-order surround signal allows the estimation of a maximum of two directions per frequency bin as it is the case with HARPEX [6]. The eigenvectors u ðkÞ Ãl , the columns of the left eigenvector matrix U (k) corresponding to the selected L eigenvalues, span the signal subspace. These vectors give the estimation of the DOAŝ for each frequency bin k.
In a similar application, [15] uses histograms accumulating DOA estimates. This concept is evolved into a nondiscrete map in the spherical harmonics domain. The eigenvalues k ðkÞ l and the corresponding DOAsĥ ðkÞ l of all frequency bins are aggregated into a broadband singleperspective DOA map D that is represented by real-valued spherical harmonics (cf. [33]) of the order N The frequency-and eigenvalue-dependent function is used to limit the frequency range and compress large ranges of eigenvalues. DOA maps according to (26) are continuous and have implicit smoothing that depends on the order, what permits interpolated evaluation at arbitrary directions. This is necessary for the intersection of the single-perspective DOA maps, sampled on a three-dimensional grid.

Multi-perspective acoustic activity map
The computation of three-dimensional data from the single-perspective DOA maps is based on work in [13][14][15] and further involves spherical harmonic representation/ encoding as in Section 3.1 as well as decoding to interpolate for the subsequent computations. The continuous maps (26) obtained allow the computation of values for any direction, and in extension: position. Despite this multi-perspective fusion of DOA maps to sound object position implies somewhat omnidirectional sources, at least statistically, the rendering (Sect. 2) takes source directivity into account, as far as observable from the sparse recording perspectives. It does so in the position-dependent signal extraction (Sect. (10)) by favoring the perspectives best aligned with the radiation directions to the listener.

and
We introduce a set of positions s i with i = 1 . . . G, which can be regarded to be arbitrary for now and will be used in two different ways below. Then we define the unit directions that sample the directions from each recording perspective position p p to every position s i in the set. The corresponding directions discretize the DOA map of any perspective by evaluating a spherical harmonic interpolation matrix and hereby discretizing the single-perspective DOA map to display DOA activity for the positions s i These discrete DOA activities are subsequently weighted by a distance factor to give emphasis to perspectives close to the position s i , This function is applied element wise to w p , and d d = 2 is the decay factor chosen here. Next, the distance-weighted single-perspective values are combined by applying an l-norm The term w p [i] denotes the i-th element of the vector. This yields one value c i for each position s i that will be subsequently called the acoustic activity for the positions in question. The acoustic activities {c i } for an entire set of positions {s i } are called acoustic activity map, below (Fig. 5).

Peak detection algorithm
We apply peak detection to the acoustic activity map at every time frame, yielding a set of Q peak observations O = [o 1 . . . o Q ] with the corresponding acoustic activity peak values C q with q = 1. . .Q.
The peak detection algorithm is a greedy sequential algorithm similar to [15], however taking advantage of the continuous spherical harmonic DOA maps. We define the set of nodes s i (cf. Sect. 2) on an equidistant threedimensional grid with i = 1 . . . G positions and a grid spacing of d s , here chosen to be 0.25 m. The nodes are used to evaluate the acoustic activity following the procedure of equations (28)- (32). The maximum of these grid values is selected. It being the global maximum of the acoustic activity, it likely is the loudest sound object in the scene.
The position of the peak, the observation o q , is defined as the grid position corresponding to that maximum To detect local maxima representing other sound objects and at the same time avoid ghost peaks, we apply peak deletion on the DOA maps after each iteration of peak detection. This removal of peaks is similar to approaches [42,43] for directional loudness editing in Ambisonics or generalized spherical beamforming [44][45][46]. A particularly suitable suppression function is denoted as and applied to each single-perspective DOA map D p . The function zeros the spherical harmonic representation at the direction h 0 , which is the perspective-to-peak direction This removes an on-axis normalized, order-weighted beam pattern using that is directed towards the peak to be erased. The order weights a are essentially arbitrary, but maxRE-weights [33] of an order slightly smaller than the one of the DOA map proved to work well. This deletion step is done before continuing to detect the next loudest peak, as it minimizes the likelihood of an erroneous detection of ghost peaks associated with sub-optimal ray intersections belonging to the previous peak, cf. Figure 6a. A more elaborate explanation of the deletion function and order weights can be found in [47]. The sequential peak picking is repeated until either a defined number Q of observations is reached or a peak value threshold is reached. The step sequence is presented in Algorithm 2.
while Peak magnitude is above threshold and maximum number of observations not reached do They are most likely noisy and therefore not directly useable for position estimation of the salient sound objects. To compute time-coherent trajectories for this position estimation, particle filters are introduced. Extending the concepts introduced in [21][22][23], particle filters are used in conjunction with a probabilistic detection algorithm, which evaluates the aforementioned peak observations using transitional probabilities. This evaluation involves a procedure to (a) start tracking a detected peak, (b) continue tracking a peak and (c) stop tracking a peak. Each peak considered valuable is associated with its own particle filter instance. Details on such an instance are found in Section 3.5. For all observations and known objects the procedure below is considered, starting at the most prominent peak: (a) The lifecycle of a tracking instance starts when the value P ðmÞ q ðH new Þ 2 0; 1 ½ for the peak observation o q exceeds a threshold of 0.7. Is this the case, then a particle filter for this peak is initialized at the time instant m. (b) The continuation of a currently tracked peak is determined by the value P ðmÞ s 2 0; 1 ½ . If it exceeds a threshold of 0.6, then the sound object is deemed existent, active, and observable, and a location estimation is computed for the current time instant m. After first creation of an instance according to (a), the threshold must be exceeded for longer than 0.1 s for the estimation to be considered viable, to avoid spurious detection. (c) Complementing (b): if P ðmÞ s falls below the threshold of 0.6 for !0.6 s, the peak is deemed vanished and the instance is discontinued.
The procedure to compute P ðmÞ q ðH new Þ and P ðmÞ s is explained in Section 3.7, but it is valuable to describe the particle filter as applied in this work and its caveats in computation before.

Particle filter dynamics
A particle filter is an estimation method employing a swarm of particles to estimate statistical measures. Here, it is used to estimate the continuous, true position of a peak in the activity map from the statistical centroid of the particle-swarm positions. The particle swarm random-samples the map around the true peak that it is directed to follow, and each particle has its own inertia and momentum. The concept of particle filters has been applied in estimation problems in audio applications such as in [19,22,23]. A non-exhaustive list of literature on particle filter theory is [48][49][50][51][52][53][54][55].
Since the peak observations O introduced in Section 3.3 are not time-coherent and their positional accuracy is Algorithm 2: Step sequence of the peak picking and deletion algorithm. Intersection of directional information can lead to ghost peaks in three-dimensional data. DOA map before peak deletion. Maximum at (À45, À40). Renormalized DOA map after peak deletion. Maximum at (45, 0). determined by the grid spacing, the particle filtering approach is used to obtain a consistent and continuous peak position estimation over time.
When a new peak is observed, as described in Section 3.4, an instance of a particle filter is initialized. This is done by sampling N = 100 positions following a multivariate Gaussian distribution centered around the peak observation o q (34). The covariance matrix is chosen, so the sampling range covers the inaccuracies introduced by the grid spacing d s and is defined as A particle is now defined by a state vector holding position and velocity, set to zero initially, in three dimensions, and a weight q i determining the importance, as named in [55]. The particle weights q i are the normalized acoustic activity map values where c i are the acoustic map values for the particle positions x i following the procedure from equations (28) to (32) without peak deletion. The estimation of peak position is done by taking a weighted mean, or centroid, of the particle positions, see (40).
The velocity values are not required above, but ensure that the trajectories obtained for the peak positions evolve smoothly over time. To ensure trajectories can be adjusted to reasonable physical qualities, the excitation-damping dynamical model adapted from [19] and [22,23] is applied to each particle s i . The prediction step computes the particle state for time instant m from the state of the previous time instant m + 1. For this, the state-space system from [19] with the time step Dt determined by hopsize is denotes as the system matrix ; ð44Þ as well as the process noise with Including the process noise F is necessary to account for the unpredictability in the trajectory of unknown moving objects. The two factors determine the behavior of the particles. The factors a dyn and b dyn are chosen dependent on the expected object movement. They determine the damping of velocity and the process noise influence on the prediction step. In this application, these values were chosen to be a dyn = 2 and b dyn = 0.04 as the active sound objects are mostly static. The particle filter dynamics above describes how the particles move, however as the process noise lead to random directions, the particle swarm still needs to be led towards the locations of the greatest acoustic activity. This achieved by importance re-sampling, which is a step that relocates particles of low importance to locations of particles with high importance; a relocated particle continues its motion from its new location. In this work this re-sampling is applied every time instant m, subsequent to the application of the object detection algorithm, which is subsequent to particle position prediction, cf. Algorithm 1. The re-sampling scheme used in this algorithm is the systematic approach introduced in [56,53]; other methods are multinomial [48,53], residual [56,52], and stratified ( [50], is done by sampling).

Anti-causal computation
Since the probabilistic calculations and certain time thresholds of the algorithm impose a lag on the detection of actively trackable sound objects, anti-causal look-ahead processing is introduced. After the first analysis of the microphone array recordings, a part of the process is repeated in the time-reversed direction.
Each object detection event from the causal process is used as starting point for the anti-causal processing. The probabilistic detection algorithm is applied in the same manner as in the causal process, however without the ability to initialize new track instances. Only the continuation or removal of existing instances as introduced in Section 3.3 is applied. The particle filter prediction step itself is applied using a negative time step size Dt ? ÀDt in (44).
The computation of the prediction step, particle weights, and position estimation is only done for time instants m at which the causal processing has not declared the object as active yet.

Object detection algorithm
Section 3.3 already introduced the concepts of threshold values determining the initialization, continuation, and removal of tracking instances. These values are computed by a probabilistic algorithm introduced in [21,22] that is adapted to our three-dimensional application.
The sequential peak detection algorithm introduced in Section 3.3 yields Q instantaneous observations o q (34) and peak activity values C q (33). These peaks can be caused by objects directly, but are still strongly affected by noise, reflections, discretization artifacts, and other interference. Therefore, each detected peak (observation) is evaluated in terms of its viability for the peak tracking algorithm. To accomplish this, the probability of the observation (detected peak) to existing sound objects needs to be quantified, or the likelihood of it belonging to a new object, or it being a false detection. The local relative peak prominence of each peak observation is the relation between first and therefore maximum peak q = 1 and the remaining peaks q > 1 in contrast to empirically defined calculation in [22] or histogram peak values in [23] that are found in literature.
For the subsequent explanation, we assume an active sound scene of s = 1 . . . S initialized peak tracking instances yielding the position estimatesx s (42). All variables introduced in Section 3.5 now carry the additional index for the tracking instance s. For each sound-object track s, we define the observability as in [22,23], but denoted differently as for the time instant m. The object activity A (m) and the object existence E (m) are probability values for the object peaks that are being actively tracked. The activity is computed using the first-order Markov model with the transition probabilities p A = 0.95 and p" A ¼ 0:05 representing the state transition from active to active state and inactive to active state, respectively, just as in [22]. The intermediate activity valueÃ ðmÞ including the object tracking probability P s . The small value > 0 added to the denominator ensures numerical robustness. Then, the recursive existence is defined in [22,23] as the function that automatically assumes high values of high tracking probability P ðmÀ1Þ s , or maps small values of P ðmÀ1Þ s with smoothed falling slopes in E (m) , with the smoothing constant l, which is set to 0.5.
The paper [22] introduces the following hypotheses: the observation with index q is caused (1) by a tracked sound object with index s = 1 . . . S, denoted as H s , (2) by a false detection caused by interference, denoted as H fa , or (3) by a new sound object, denoted as H new .
The set of observations o q with q = 1 . . . Q has to be mapped to these hypotheses. This results in r = 1 . . . (S + 2) Q mapping combinations (cf. Fig. 7), which have to be evaluated. Assuming conditional independence of the single observations, this is done by computing the association probabilities for each combination r. The term H q is the hypothesis mapped to the observation q for the combination r. The term pðo m ð Þ q jHÞ describes the likelihood of an observation at the position o q while mapped to a certain hypothesis. It is defined as This involves a priori knowledge. The three-dimensional spatial distribution function p fa (o q ) describes the probability of an observation o being a false detection, e.g. volumes blocked by obstacles will have a high probability for false detection. Similarly, p new (o q ) describes the known probability of new sound objects appearing in the scene, e.g. on a stage. Both can be set to a uniform distribution if there is no specific knowledge is available. The function pðo m ð Þ q jx ðmÀ1Þ s Þ is not two dimensional as in [22,23], but defined as a three-dimensional multivariate Gaussian distribution centered at the latest estimatex s and evaluated at the observation position o q . The covariance is estimated by the particle positions of the corresponding particle filter as with a scaling, chosen to be c = 4, here. The second term describes probabilities incorporating the peak prominence (49) and time instant-wise computation of the observability (50). It additionally introduces the factors p fa = 0.8 and p new = 0.2 for tuning the impact of the prominence P q . The subsequent summation for all combinations containing the hypothesis gives the marginal probabilities of observation-hypothesis mappings. This is computed as where the Kronecker delta denotes the selection from the set of (S + 2) Q mappings where the hypothesis H is included (see Fig. 7). The value P ðmÞ q ðH new Þ is required for tracking control in Section 3.4. The object tracking probability of each known sound object is evaluated as and is also employed in tracking control of Section 3.4.

Technical evaluation of object position estimation
The accuracy of sound object detection and positional accuracy was evaluated by simulating sound scene recordings with varying noise interference [47]. The simulated scene was created with the simulation library for MATLAB library MCRoomSim 2 [57]. The room is a 6 m Â 6 m Â 3.5 m box model with default room acoustic settings. For this evaluation, four static sound objects, two male and two female speakers from the EBU SQAM collection [58], are situated in the simulated room at static positions (cf. Tab. 1). Microphone arrays are located at 1 m intervals ranging from 3.5 m to 6.5 m on the x and y coordinate. To evaluate the detection capabilities also covering grids with volumetric extent, vertical layers are positioned between 0.5 m and 2.5 m with 1 m spacing along the z axis. This results in 48 virtual microphone arrays. The simulation models microphone arrays as non-coincident using the array geometry of the Oktava MK4012 [37]. The scenes point of origin is located at the center of the room's floor.
To measure the accuracy of the algorithm, the simulated scene is analyzed at different levels of noise interference. The SNR here is defined as the relation between the loudest microphone signal and uncorrelated white noise added to all microphone signals independently with the same signal energy. The SNR values for this evaluations were SNR 2 {9, 12, 15, 18} dB.

Error measures
Mean Distance Error: The measure is defined as the distance between the ground truth to the nearest sound object. Only 1-to-1 mappings are allowed, therefore if an object is already in use for calculation then the next-nearest will be used if existent. The measure is computed for every sound object j = 1 . . . 4 and averaged over N trial trials and M time frames whenever actively tracked sound objects were found: Case (a) is applied if no nearest sound object could be found that was not assigned yet, while (b) applies whenever the

Results
The mean distance error (MDE) is low with SNR values at 15 dB and 18 dB, as visualized in Figure 8a. The confidence intervals suggest stable values between 2 and 10 centimeters. On the other hand, large confidence intervals at the smaller SNR values suggest unreliable detection and position estimation. The activity error time (AET), see Figure 8b, shows a similar dependency on SNR confirming the SNR requirement for this system at !15 dB.
In summary, the evaluation showed promising accuracy that led us to confide in rendering that might be able to synthesize a spatial auditory image of detected sound objects that closely resembles the one of the recorded scene.

Perceptual evaluation
To evaluate the effectiveness of the proposed procedure of rendering, listening experiments have been conducted. A simple scene consisting of two static active sound objects is analysed. The male speaker (Track Nr. 50) and the piano (Track Nr. 60) of the EBU SQAM collection [58]. Here the time interval [1;8] s (piano) and [3;8] s (speech) are used where start of the speech signal is 2 s delayed behind the piano signal. A second scene is simulated where only one of the objects is present, which in turn is used as the baseline for the rendering using the estimated position data from the two-object scene. The purpose of this approach is to include possible artifacts and inaccuracies stemming from inter-object interference in analysis whilst minimizing complexity for the listeners of the experiments.
The experiment was conducted in two parts. First, static listener perspectives were presented to the listener where the virtual listener is positioned at fixed locations facing the active sound object. The second part presented dynamic perspectives, representing a linear motion of the listener perspective along a pre-defined path. In this case the look direction was varied between four distinct orientations. These pre-defined modes of motion are used for auralization using a binaural decoder, so that the experiment can be conducted using headphones without the need for head/ position tracking equipment.
The methodology of the experiment was a MUSHRAlike [59] comparative evaluation asking for the perceptual similarity to the reference, also described as the authenticity. The conditions of the comparison include the simulated reference, the proposed method, two broadband rendering methods, and an anchor. 3 The reference is the binaural rendering of the listener perspective as simulated by MCRoomSim [57].
The proposed condition is a binaural rendering following the procedure introduced in Section 2 and Section 3.
The VLO approach was introduced in [30,31] and is a broadband spatial rendering method to enable acoustic scene playback with spatially distributed surround recordings. Refer to Figure 3a for an overview. The virtual loudspeaker objects are encoded in third-order Ambisonics and decoded to binaural signals with the IEM Binau-ralDecoder [60].
The Vector-Based Intensity Panning (VBIP) approach is a simple superposition of three surround recordings transformed to first-order Ambisonics signals weighted with the areal coordinate approach. Again, the IEM Binau-ralDecoder 4 was used for decoding to headphone signals. A previous study on the performance of VLO and VBIP as done in [61].
To give a low-rating reference, a mono version of the VBIP condition was added as anchor. This was visually not identifiable and part of the randomly sorted multistimulus presentation.

Experiment
Both the static and dynamic listening tasks used a room measuring 6 m Â 6 m Â 3.5 m simulated in MCRoomSim with the same grid layout as in the technical evaluation above: equally spaced according to Figure 9 and vertically repeated in three layers at 0.5 m, 1.5 m and 2.5 m height. The simulated arrays model the geometry of the Oktava MK4012 [37].
The static-perspective tasks evaluated four listener positions that are visualized in Figure 9. Such a task consisted of a the comparative rating of authenticity where the reference was always visible to the listener and the order of conditions was randomized and not visually identifiable. Each static position was rated twice by each participant, also randomly sequenced within the set of multi-stimulus tasks.
Further the dynamic-perspective tasks consisted of the comparative rating of authenticity with visible reference. The four look directions A, B, C, D shown in Figure 9 were evaluated twice by each participant, in randomized condition and task orders.
The signals were optimized for the most common AKG and Beyerdynamic high end models to minimize coloration. In total, 16 expert listeners aged between 24 and 39 (average age: 29) took part in the listening experiment taking 30 min on average to complete it. The pairwise statistical significance was assessed with a Wilcoxon signed rank test [62] with Bonferroni-Holm correction [63]. There were 32 responses for each condition yielding 4 Â 32 = 128 responses when merging over positions 1 to 4 in part 1 and look directions A-D in part 2. The data proved to be consistent enough to be merged.
All plots in Figure 10 show the sample median of collected sample populations and !95% confidence intervals. The choice of the median over the mean is based on its higher robustness towards outliers, especially relatively small sample populations. The confidence intervals are computed by applying the binomial test [64,65] to the samples of each evaluated condition.

Results
Static-perspective tasks: The single position ratings are visualized in Figure 10a (Median and !95% confidence intervals). Participants rated the proposed approach higher than all the other conditions, with statistical significance (p < 0.001) at positions 1-3. At position 4, however, VLO rendering shows significantly higher rating (p < 0.001) than VLO at 1-3, and no significant difference to the proposed one (p = 0.0541). The comparison of merged VLO data from Positions 1 and 3 on the one hand and 2 and 4 on the other hand (not displayed) exhibits an advantage (p < 0.001) of the direct perspective positions over the interpolated ones. The ratings of the VBIP approach show a decrease with distance, when ratings are compared with locations in Figure 9. The difference is significant when comparing the farthest and closest position (p < 0.001).
Dynamic-perspective tasks: Figure 10c shows, the ratings for all conditions are very similar over look directions A-D suggesting that look direction is of little influence. Moreover, the proposed and VLO methods are consistently rated higher (p < 0.001) than the VBIP condition. Between proposed and VLO, the advantage is not as strong but existent at all look direction A (p = 0.0696), B (p = 0.0218), C (p < 0.001) or D (p = 0.0650). This experiment supports the findings of [61] where the VLO approach performed better than the VBIP approach.
The merged responses across all directions of the dynamic-perspective experiment (Fig. 10d) imply a significant mean difference (p < 0.001) between ratings of proposed and VLO/VBIP, supporting the results of the static-perspective experiment (Fig. 10b).

Discussion
This listening evaluation confirmed the intended improvements in scene resynthesis by comparison with For the static-perspective part, the authenticity, i.e. the similarity to the reference, of the proposed is consistently rated high, and in the majority of the cases higher than the alternatives. As the one exception, the observed drop off at position 4 is most likely due to the high rating of VLO and combined with the limited scale of the ratings. Further, VLO shows ratings seemingly depending on the listener position and the significant increase in rating at position 4 is due to the fact that this position is a direct microphone array position and far away from a source. There, the surround perspective of the microphone position provides accurate reproduction just as with VBIP, however providing a better room impression owed to the rich diversity of VLOs and their directions. By contrast, spatial reproduction with VLO and VBIP suffers at interpolated, less diffuse listener perspectives.
The dynamic-perspective part of the experiment shows an increase in ratings for VLO as the interpolation is perceived smoother and is rated almost as high as proposed.
As residual signals of proposed are based on the VLO approach, smoothness and general impression are understandably similar whenever the virtual listener has moved away from the sources. The advantages lie in auditory localization of direct sounds through the proposed object direct signal extraction and encoding. The data shows this improvement in the single-direction as well as the mergeddata comparisons.

Conclusion
In this contribution we proposed an analysis and resynthesis method for acoustic scenes recorded with distributed surround microphone arrays, in the investigated case tetrahedral A-format Ambisonics microphones. We could show that multi-perspective recordings provide sufficiently much additional information for rendering with significantly improved spatial accuracy and authenticity, already when performed broadband, in the time domain, only. This effectively avoids any risk of introducing musical-noise artifacts that any potentially more effective time-frequency processing intrinsically bears.
A numerical experiment considered sound objects of a simulated scene and could prove good accuracy in object position and signal activity estimation, and it revealed a 15 dB SNR or direct to diffuse ratio limit that local microphones around the active sound object should be able to satisfy. We could further verify the suspected improvement compared to two known broadband perspective interpolation approaches in a two-part listening experiment; its results show improvements in authenticity and spatial definition.