Issue 
Acta Acust.
Volume 5, 2021



Article Number  20  
Number of page(s)  16  
Section  Virtual Acoustics  
DOI  https://doi.org/10.1051/aacus/2021012  
Published online  30 April 2021 
Scientific Article
Particlefilter tracking of sounds for frequencyindependent 3D audio rendering from distributed Bformat recordings
Institute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Inffeldgasse 10/III, 8010 Graz, Austria
^{*} Corresponding author: matthias.blochberger@posteo.at
Received:
2
October
2020
Accepted:
16
March
2021
SixDegreeofFreedom (6DoF) audio rendering interactively synthesizes spatial audio signals for a variable listener perspective based on surround recordings taken at multiple perspectives distributed across the listening area in the acoustic scene. Methods that rely on recordingimplicit directional information and interpolate the listener perspective without the attempt of localizing and extracting sounds often yield high audio quality, but are limited in spatial definition. Methods that perform sound localization, extraction, and rendering typically operate in the timefrequency domain and risk introducing artifacts such as musical noise. We propose to take advantage of the rich spatial information recorded in the broadband timedomain signals of the multitude of distributed firstorder (Bformat) recording perspectives. Broadband timevariant signal extraction retrieving direct signals and leaving residuals to approximate diffuse and spacious sounds is less of a quality risk, and likewise is the broadband reencoding to enhance spatial definition of both signal types. To detect and track direct sound objects in this process, we combine the directional data recorded at the single perspectives into a volumetric multiperspective activity map for particlefilter tracking. Our technical and perceptual evaluation confirms that this kind of processing enhances the otherwise limited spatial definition of directsound objects of other broadband but signalindependent virtual loudspeaker object (VLO) or VectorBased Intensity Panning (VBIP) interpolation approaches.
Key words: 6DoF rendering / Variableperspective rendering / Multiperspective audio
© M. Blochberger and F. Zotter, Published by EDP Sciences, 2021
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
The interactive rendering of recorded auditory scenes as virtual listening environments requires an approach to allow six Degrees of Freedom (6DoF) of movement for a variable listener perspective. The variableperspective rendering of auditory scenes requires interpolation between static recording perspective positions. In existing research, this concept is often referred to as scene navigation or also scene walkthrough. This contribution mainly refers to firstorder tetrahedral microphone arrays as means for recording surround audio for high fidelity applications.
While volumetrically navigable 6DoF recording and rendering are theoretically feasible, practical distributions of multiple static 3D audio recordings typically consider capturing perspective changes along the horizontal dimensions to enable walkable rendering of the auditory scene.
Perspective extrapolation of a single perspective for a shifted listening position has been considered in the SpaMoS (spatially modified synthesis) method by Pihlajamäki and Pulkki [1, 2] that estimates timefrequencydomain source positions by projecting directional signal detections of DirAC (directional audio coding [3, 4]) onto a predefined convex hull (e.g. the room walls). This method is assumed to be accurate in spatial reproduction when the parallactic shift with regard to the original perspective recording stays small. Similarly, Plinge et al. [5] utilize DirAC in combination with known distance information to rotate and attenuate sources in a singleperspective recording to extrapolate its perspective. This approach would be expandable by multiple directional signals obtained via HARPEX by Barrett and Berge [6] for firstorder signals, as Stein and Goodwin suggest in [7]. Higherorder Ambisonics signals are used in a work by Allen and Kleijn [8] that employs a matchingpursuit algorithm for multidirectional signal decomposition, takes into account estimated source distances, and labels reflections and direct sounds. Kentgens et al. [9] apply alternative multidirectional decomposition, e.g., the subspace methods SORTE/MUSIC for extraction and shifted reencoding of direct components, complemented by a noise and diffuseness subspace that preserves ambient sounds. Birnie et al. [10] introduce a sound field translation method for 6DoF binaural rendering based on sparse planewave expansion for near and far sources arranged on two rings around the higherorder recording perspective. Altogether, the singleperspective extrapolation approaches require either parametric timefrequency processing or higher order microphone arrays to achieve directional definition, and their extrapolation range depends on how successful distance information is guessed or estimated. Alternatively, Bates and O’Dwyer [11], Lee et al. [12] employ a more classical, spaced array augmented by controllabledirectivity microphones to simulate an extrapolated listening perspective.
Multiple perspectives contain additional information needed for explicit acoustic source localization that enlarges the supported range of shifted listening perspectives with high spatial definition. Brutti et al. [13, 14], Hack [15] or Del Galdo et al. [16, 17] introduce object localization methods using maps of the acoustic activity to localize or triangulate the sources within a scene. In [13–15], sequential peak picking algorithms are proposed to avoid erroneous detections in correlation and intensitybased triangulation respectively. The virtual microphone method [16, 17] utilizes the detected location every shorttime frequency bin to assemble a virtual microphone signal of an arbitrary parametrically rendered directivity pattern, what in principle would allow parametric 6DoF rendering. Zheng [18] combines sound object detection with timefrequencydomain signal separation to extract direct signals for sound field navigation. Particle filters to track sound sources detected in reverberant environments were described by Ward et al. [19], and Fallon et al. [20] introduce a particle filterbased acoustic source tracking algorithm for a timevarying number of sources. Probabilistic detection was combined with particle filters for directional tracking of sounds was introduced by Valin et al. [21, 22] for robotic applications, which was later adapted by Kitić and Guérin [23] to a firstorder Ambisonics application.
Multiperspective recordings also permit perspective interpolation methods with information about source locations staying implicit. Tylka and Choueiri [24–26] proposed interpolatory sphericalharmonic reexpansion at low frequencies for multiperspective first or higherorder Ambisonic recordings. Similar to the simplified treatment of highfrequencies in Tylka et al., there is also a number of simplistic broadband signal interpolation methods working in the time domain, only. Such methods avoid any risk of introducing musical noise artifacts, which can happen in timefrequencydomain processing. Mariette et al. [27] mix the firstorder Ambisonics signals of the three nearest recording positions proportional to their proximity, similar to the proposal of Schörkhuber et al. [28] that introduces additional amplitudedependent gains. Patricio et al. [29] propose a distancebased linear interpolation between higherorder Ambisonic recordings in the time domain, in which the higherorder content of distant microphones is faded out, while proximal recording perspectives remain unaltered. Multiple perspective recordings are mapped to multiple surround playback rings of virtual loudspeaker objects (VLOs) by Grosche et al. [30] and Zotter et al. [31], of which the direction and amplitude (involving a distance and directivity function) vary with the listener position and are employed in higherorder reencoding of the VLO signals. A good introduction to scene recording and sound field interpolation especially in multimedia VR applications is given by RivasMendez et al. [32].
For simplicity, simulations and experiments in this contribution deal with an equidistant grid of recording perspectives volumetrically distributed within a homogeneous auditory scene, however the approach introduced is more general. We introduce a multiperspective interpolation method that merges and extends detection/tracking and broadband signal processing concepts found in literature. A broadband signal extraction and rendering method is utilized for artifactfree signal processing in combination with automatic signal detection and position estimation for higher spatial accuracy. The estimated position of any detected object is used to steer broadband beamformers at the nearest recording positions to capture the object’s direct sound. Weighted and delaycompensated combinations of the extracted signals yield approximated direct signals, while residual signals with directsound directions suppressed aim to reintroduce enveloping components of the diffuse sound field. Signal extraction and encoding procedures are described in Section 2 and scene analysis procedures in Section 3. Detection accuracy is technically investigated in Section 4 under varying SNR conditions. To assess the performance and achievable improvement of the proposed algorithm applied to a simple acoustic scene recording with static objects, a twopart listening experiment compares the rendering method with two existing broadband 6DoF rendering methods in Section 5, for a static and a moving listener.
2 Frequencyindependent 6DoF Rendering
Given the listener position, the microphone array positions, and assuming to know the sound source positions, we can compute the signals customized to the acoustic perspective of a single virtual listener, delivered as a stream of higherorder Ambisonics [33] signals. This delivery format features the benefits of modular decoders facilitating playback on headphones or loudspeaker layouts and the Ambisonic sound field rotation operator. In case there are multiple virtual listeners, the listeningpositiondependent processing steps are carried out interactively and separately for each listener, excluding the soundscene analysis steps that are precomputed offline.
2.1 Signal encoding and decoding
The multichannel Ambisonics signal χ(t) of the listener perspective of the order N is computed by multiplication of a singlechannel signal with the weights of an encoder y_{N}(θ). Such an encoder consists of the N3Dnormalized spherical harmonics evaluated at the unitlength direction vector θ = [cosφ sinϑ, sinφ sinϑ, cosϑ]. The theoretical background of spherical harmonics and the concept of Ambisonics can be found in literature, e.g. [33], and practical implementation of encoders and decoders alike is easily accomplished with libraries such as introduced by [34].^{1} We will encode the listeningpositiondependent object direct signals (cf. Sect. 2.2) and the residual signals (cf. Sect. 2.3), depending on the relative direction vector θ to the listener. For S signals in total, the signals s_{i}(t), their gains g_{i}, and the instantaneous direction θ_{i} = θ_{i}(t) of each signal, encoding is defined as
Rotation of the Ambisonics perspective is necessary with headphone playback (Fig. 1). It enables taking into account the listener’s head rotation to simulate a static acoustic scene outside the headphones. This is done by applying a (N + 1)^{2} × (N + 1)^{2} rotation matrix R(φ, ϑ, γ) (cf. [35, 33]) to the Ambisonics signal
Figure 1 Diagram of the rendering algorithm. Position data (cf. Sect. 3) of sound objects is used to render the Ambisonics listener perspective using real time tracking of the listener position and head rotation. 
To render the headphone signals from the variablerotation Ambisonics signal , the MagLS binaural decoder is used, as described in [36]. With the exception of this binaural decoder, no frequencydependent signal processing is introduced by the signal processing proposed.
The upcoming sections introduce the methods to extract object direct signals and residual signals from the multiperspective recordings, and they define gain and direction values for the encoding step in (1).
2.2 Object direct signal extraction
The object direct signals are approximations of the signal emitted from audible objects in the scene arriving at the virtual listener. Their positions are assumed to be known here; Section 3 thereafter introduces position estimation.
Figure 2 shows the positions of three microphone arrays, an object position, and the listener position, which are all required to approximate the direct signal of the object as a weighted sum. In general, for each known or estimated sound object with position inside the scene, a simplex of surrounding microphone arrays is selected to define the closest set of surrounding recordings taken. The signals at its vertices are likely to capture the cleanest instances of the object’s direct sound. Initial weights for the signals at these vertices are obtained from the area coordinates, the barycentric coordinates of a point in the triangle (Fig. 2) as the typical simplex when recording positions are distributed horizontally, cf. (7); or within a tetrahedron if recordings were distributed volumetrically. But first, the microphone arrays at any of these position are a spherical constellation of transducers, and θ_{m,j} denotes the transducer direction vectors. In our case, the index range is j = 1 … 4, as the arrays considered are tetrahedral Oktava MK 4012 [37]. The tetrahedral cardioid microphone signals of each recording perspective are encoded into N3Dnormalized firstorder Ambisonics by
assuming the jth recorded transducer signals of the perspective p is denoted as s_{p,j}(t). From this, a signal
is extracted by applying beamforming vector g that steers towards the object position, here denoted as −θ_{op,i}, which is illustrated in Figure 2. This beamforming vector is modelled as onaxisnormalized maximumdirectivity (in first order: hypercardioid) and can be computed as
where y_{1} is the encoder of first order in the negative objectperspective direction −θ_{op,i}.
Figure 2 Three perspective extraction. 
To combine the three beamformingextracted signals of an object from the triplet i = 1 … 3 of into a single direct signal, a combined gain is defined for the object s and perspective i
Herein, the gain g_{dir,s,i} denotes the areal or barycentric coordinate weight, cf. [38]. Assuming a projected sound object position , it favors the closest perspective from the given triplet of projected positions p_{2D,1}, p_{2D,2}, p_{2D,3} and yields
with
The remaining value is computed by
The factor g_{dir,s,i} quantifies the alignment of the objectperspective directions θ_{op,i} with the objectlistener direction θ_{ol,s}. Its task is to favor perspectives that are directionally aligned with the direction of radiation from the source to the listener, and it hereby favors the most suitable surrounding perspectives in terms of source directivity. It is formalized as cardioid
of the order α. Again, the vectors θ_{op,i} and θ_{ol,s} are unit vectors and illustrated by Figure 2.
The signal encoding as introduced with (1) requires direction vector, signal and gain. The objectlistener direction θ_{ol,s} is employed to compute the encoder (Sect. 2.1) for the approximated object direct signal, which is the combination of the areal coordinate gainweighted (6) and delaycompensated extracted signals (4)
Here, delay compensation is done based on the speed of sound c and distance differences Δt_{s,u} = c^{−1} (d_{ol,s}−d_{op,s,i}), where the distances d_{ol,s} and d_{op,s,i} denote the objectlistener and objectperspective distances for the object s and perspective i = 1 … 3 in the triplet, respectively, see Figure 2. To model a realistic distancedependent amplitude attenuation of the signals within the Ambisonic listener perspective the areal coordinate gains are first multiplied by the objectperspective to objectlistener distance ratio. Then the combination
is the gain that is employed in (1) and depends on the distance between listener and source. It is limited to a maximum of 4 (+12 dB) to make avoid excessive boosts whenever d_{ol,s} becomes small.
2.3 Object direct signal suppression (residual signals)
In the optimal case, the approximated object direct signals (11) exclude all room information such as early reflections and late reverberation. They provide a clean signal for accurate directional perception, however do not convey a realistic room impression to the listener. To this end, the residual signals are introduced. A similar concept of direct signal suppression, despite in the higherorder Ambisonics domain, was employed in [9] to extract ambient components.
Here, the concept of a residual signal is implemented in terms of the virtual loudspeaker object (VLO) approach [30, 31] that is illustrated in Figure 3a. Each perspective p_{p} holds a number of microphones of the directions θ_{m,j}. Each direction and microphone signal is represented by a VLO that is positioned at a finite distance R from the perspective, p_{p} + Rθ_{m,j}. The normalized direction vector from the listener position to the virtual loudspeaker objects as well as the corresponding distances are computed as
and
respectively. As described in [30, 31], the VLO gains should depend on the distance
and in direction parametrically (from unity to cardioid)
Figure 3 The VLO method [30, 31] is visualized in (a) while (b) shows the deemphasis term introduced in eqaution (18). 
Here, the gain (15) attenuates distant VLOs by . Too close ones are attenuated by r to maintain a robust result and erroneous localization. The gain (16) ensures that listeners do not hear sound far behind a VLO, which is always onaxis oriented towards p_{p}, but not abruptly so when walking through the VLO position. And the direction ϕ_{p,i} in (13) represents the parallactic displacement at a shifted listening position. The VLO approach achieves an enveloping and spatially plausible reproduction when used with multiperspective microphone arrays distributed in the recorded scene. There is potential for improvement in the spatial definition of its directsound imaging.
For this work, the VLO method is now modified to serve as a residualsignal renderer complementing the objects direct sound signals. Here, the method encodes the 4P residual signals to the listener perspective using (13) as directions in (1). It is however necessary to exclude the direct signals from the VLO signals so the spatial accuracy gained by object signal encoding (Sect. 2.2) is not diminished. To this end, deemphasis is applied to each microphonearray perspective with the goal of suppressing direct signals that belong to identified sound objects. This is done by gains depending on the objectperspective directions and the object’s direct signal amplitude applied to the signals s_{p,i}(t) for all perspectives p = 1 … P and transducers i = 1 … 4. Extending the gains (15) and (16) introduced in [30, 31] by a deemphasis term G_{p}(t, θ_{m,i}) leads to the new residual VLO gains
Figure 3b illustrates the concept of this deemphasis term suppressing direction towards the objects, which is the product of S directional gain patterns. Each of them is a mixture of unity and cardioid directivity, oriented such that the cardioid suppresses the object signals. It is defined for the array transducer directions θ_{m,i} as
with the singleobject deemphasis pattern
for which the exponent β controls the width of the directional notch and a_{p,s}(t) ∈ [0,1] permits to control the depth of the notch, or to release it for distant or quiet soundobject signals. For this purpose, a_{p,s}(t) is defined depending on the objectperspective distance and moving RMS value of (11) as
with
and g_{dis}(d_{op,p}, R_{de}) from 15 using reference distance R_{de}. The value of the threshold RMS_{s} for each sound object signal is determined by precomputation given the recordings, or it can be defined manually.
3 Estimation of object positions
Figure 4 provides an overview of the procedure to estimate the sound object positions necessary for the rendering algorithm as introduced in previous sections. Given the frequencydomain microphone array surround signals, the directionofarrivals (DOAs) of singlefrequency components are estimated and combined into DOA maps. This is similar in concept and application as the DOA histograms in [15, 18] and explained in Section 3.1. Section 3.2 introduces the method to intersect the directional information, to compute combined values, described as the acoustic activity map. Together with the subsequent sequential peak picking algorithm (Sect. 3.3), these concepts are also discussed in [15, 13, 14]. After selection of the instantaneous set of peaks by a sequential algorithm, these are evaluated in terms of probabilistic measures for sound object emergence, continuous activity and false detections. Section 3.7 explains the computation of these measures. They inform the decision making regarding the instantiation of particle filters for peak tracking. We use these particle filters, a well known MonteCarlo probabilistic estimation technique and applicationspecifically described in Section 3.5, to track the threedimensional position of the acoustic activity peaks as a timecoherent trajectory, expanding the method introduced in [22, 23].
Figure 4 The procedure of object detection and object position estimation. *I/R signifies the decisions to initialize, continue or remove particle filter instances as introduced in Section 3.4. 
Algorithm 1 The computational steps done at each time instant m.
for m = 1 … M do
(1) Singleperspective DOA maps (Sect. 3.1)
(2) Multiperspective acoustic activity map (Sect. 3.2)
(3) Peak detection (Sect. 3.3)
(4) Particle filter prediction (Sect. 3.5)
(5) Object detection (Sect. 3.7)
(6) Particle filter initialization/deletion (Sect. 3.4)
(7) Particle filter resampling (Sect. 3.5)
End
3.1 Singleperspective DOA map
Directionofarrival estimation is applied framewise to the surround microphone array signals. The applied approach is the magnitude sensor response method, introduced in [39]. In [40] it was extended to the smoothed magnitude sensor response. It is applied in the frequency domain and done by frequencywise computation of the covariance matrix Σ^{(k)} of the frequencydomain signal data S^{(k)} for bins k = 1 … K at the time instant m as an average over M time frames
of the microphone array frequency bin magnitudes S^{(k)}. A subsequent eigenvalue decomposition
gives us the possibility to further decompose it into signal and noise subspace. This is done by selecting L eigenvalues per bin k. As used in [40], this can be a fixed number L or be variable and computed with methods such as SORTE introduced in [41]. The case of a firstorder surround signal allows the estimation of a maximum of two directions per frequency bin as it is the case with HARPEX [6]. The eigenvectors , the columns of the left eigenvector matrix U^{(k)} corresponding to the selected L eigenvalues, span the signal subspace. These vectors give the estimation of the DOAs
for each frequency bin k.
In a similar application, [15] uses histograms accumulating DOA estimates. This concept is evolved into a nondiscrete map in the spherical harmonics domain. The eigenvalues and the corresponding DOAs of all frequency bins are aggregated into a broadband singleperspective DOA map that is represented by realvalued spherical harmonics (cf. [33]) of the order N
The frequency and eigenvaluedependent function
is used to limit the frequency range and compress large ranges of eigenvalues.
DOA maps according to (26) are continuous and have implicit smoothing that depends on the order, what permits interpolated evaluation at arbitrary directions. This is necessary for the intersection of the singleperspective DOA maps, sampled on a threedimensional grid.
3.2 Multiperspective acoustic activity map
The computation of threedimensional data from the singleperspective DOA maps is based on work in [13–15] and further involves spherical harmonic representation/encoding as in Section 3.1 as well as decoding to interpolate for the subsequent computations. The continuous maps (26) obtained allow the computation of values for any direction, and in extension: position. Despite this multiperspective fusion of DOA maps to sound object position implies somewhat omnidirectional sources, at least statistically, the rendering (Sect. 2) takes source directivity into account, as far as observable from the sparse recording perspectives. It does so in the positiondependent signal extraction (Sect. 2.2 and (10)) by favoring the perspectives best aligned with the radiation directions to the listener.
We introduce a set of positions s_{i} with i = 1 … G, which can be regarded to be arbitrary for now and will be used in two different ways below. Then we define the unit directions
that sample the directions from each recording perspective position p_{p} to every position s_{i} in the set. The corresponding directions discretize the DOA map of any perspective by evaluating a spherical harmonic interpolation matrix
and hereby discretizing the singleperspective DOA map to display DOA activity for the positions s_{i}
These discrete DOA activities are subsequently weighted by a distance factor to give emphasis to perspectives close to the position s_{i},
This function is applied element wise to w_{p}, and δ_{d} = 2 is the decay factor chosen here.
Next, the distanceweighted singleperspective values are combined by applying an lnorm
The term w_{p}[i] denotes the ith element of the vector. This yields one value γ_{i} for each position s_{i} that will be subsequently called the acoustic activity for the positions in question. The acoustic activities {γ_{i}} for an entire set of positions {s_{i}} are called acoustic activity map, below (Fig. 5).
Figure 5 The acoustic activity map evaluated at equidistant grids on horizontal planes for visualization with two visible peaks. 
3.3 Peak detection algorithm
We apply peak detection to the acoustic activity map at every time frame, yielding a set of Q peak observations O = [o_{1} … o_{Q}] with the corresponding acoustic activity peak values Γ_{q} with q = 1 … Q.
The peak detection algorithm is a greedy sequential algorithm similar to [15], however taking advantage of the continuous spherical harmonic DOA maps. We define the set of nodes s_{i} (cf. Sect. 2) on an equidistant threedimensional grid with i = 1 … G positions and a grid spacing of d_{s}, here chosen to be 0.25 m. The nodes are used to evaluate the acoustic activity following the procedure of equations (28)–(32). The maximum of these grid values
is selected. It being the global maximum of the acoustic activity, it likely is the loudest sound object in the scene. The position of the peak, the observation o_{q}, is defined as the grid position corresponding to that maximum
To detect local maxima representing other sound objects and at the same time avoid ghost peaks, we apply peak deletion on the DOA maps after each iteration of peak detection. This removal of peaks is similar to approaches [42, 43] for directional loudness editing in Ambisonics or generalized spherical beamforming [44–46]. A particularly suitable suppression function is denoted as
and applied to each singleperspective DOA map . The function zeros the spherical harmonic representation at the direction θ_{0}, which is the perspectivetopeak direction
This removes an onaxis normalized, orderweighted beam pattern using
that is directed towards the peak to be erased. The order weights a are essentially arbitrary, but maxREweights [33] of an order slightly smaller than the one of the DOA map proved to work well. This deletion step is done before continuing to detect the next loudest peak, as it minimizes the likelihood of an erroneous detection of ghost peaks associated with suboptimal ray intersections belonging to the previous peak, cf. Figure 6a. A more elaborate explanation of the deletion function and order weights can be found in [47]. The sequential peak picking is repeated until either a defined number Q of observations is reached or a peak value threshold is reached. The step sequence is presented in Algorithm 2.
Figure 6 (a) visualizes the problem of ghost peaks. (b) and (c) show the application of the peak deletion function, which removes directional components from the DOA maps . Intersection of directional information can lead to ghost peaks in threedimensional data. DOA map before peak deletion. Maximum at (−45, −40). Renormalized DOA map after peak deletion. Maximum at (45, 0). 
Algorithm 2 Step sequence of the peak picking and deletion algorithm.
while Peak magnitude is above threshold and maximum number of observations not reached do
(1) Compute acoustic activity map (32)
(2) Pick global maximum (33 and grid position 34)
(3) for each Perspectives do
(a) Compute direction vector (36)
(b) Apply peak deletion (35)
end
end
3.4 Peak tracking with particle filters
The peak observations O = [o_{1} … o_{Q}] introduced in Section 3.3 are instantaneous for a time instant m. They are most likely noisy and therefore not directly useable for position estimation of the salient sound objects. To compute timecoherent trajectories for this position estimation, particle filters are introduced. Extending the concepts introduced in [21–23], particle filters are used in conjunction with a probabilistic detection algorithm, which evaluates the aforementioned peak observations using transitional probabilities. This evaluation involves a procedure to (a) start tracking a detected peak, (b) continue tracking a peak and (c) stop tracking a peak. Each peak considered valuable is associated with its own particle filter instance. Details on such an instance are found in Section 3.5. For all observations and known objects the procedure below is considered, starting at the most prominent peak:
The lifecycle of a tracking instance starts when the value for the peak observation o_{q} exceeds a threshold of 0.7. Is this the case, then a particle filter for this peak is initialized at the time instant m.
The continuation of a currently tracked peak is determined by the value . If it exceeds a threshold of 0.6, then the sound object is deemed existent, active, and observable, and a location estimation is computed for the current time instant m. After first creation of an instance according to (a), the threshold must be exceeded for longer than 0.1 s for the estimation to be considered viable, to avoid spurious detection.
Complementing (b): if falls below the threshold of 0.6 for ≥0.6 s, the peak is deemed vanished and the instance is discontinued.
The procedure to compute and is explained in Section 3.7, but it is valuable to describe the particle filter as applied in this work and its caveats in computation before.
3.5 Particle filter dynamics
A particle filter is an estimation method employing a swarm of particles to estimate statistical measures. Here, it is used to estimate the continuous, true position of a peak in the activity map from the statistical centroid of the particleswarm positions. The particle swarm randomsamples the map around the true peak that it is directed to follow, and each particle has its own inertia and momentum. The concept of particle filters has been applied in estimation problems in audio applications such as in [19, 22, 23]. A nonexhaustive list of literature on particle filter theory is [48–55].
Since the peak observations O introduced in Section 3.3 are not timecoherent and their positional accuracy is determined by the grid spacing, the particle filtering approach is used to obtain a consistent and continuous peak position estimation over time.
When a new peak is observed, as described in Section 3.4, an instance of a particle filter is initialized. This is done by sampling N = 100 positions following a multivariate Gaussian distribution
centered around the peak observation o_{q} (34). The covariance matrix is chosen, so the sampling range covers the inaccuracies introduced by the grid spacing d_{s} and is defined as
A particle is now defined by a state vector
holding position and velocity, set to zero initially, in three dimensions, and a weight q_{i} determining the importance, as named in [55]. The particle weights q_{i} are the normalized acoustic activity map values
where γ_{i} are the acoustic map values for the particle positions x_{i} following the procedure from equations (28) to (32) without peak deletion.
The estimation of peak position is done by taking a weighted mean, or centroid,
of the particle positions, see (40).
The velocity values are not required above, but ensure that the trajectories obtained for the peak positions evolve smoothly over time. To ensure trajectories can be adjusted to reasonable physical qualities, the excitationdamping dynamical model adapted from [19] and [22, 23] is applied to each particle s_{i}. The prediction step
computes the particle state for time instant m from the state of the previous time instant m + 1. For this, the statespace system from [19] with the time step Δt determined by hopsize is denotes as the system matrix
as well as the process noise
with
Including the process noise F is necessary to account for the unpredictability in the trajectory of unknown moving objects. The two factors
determine the behavior of the particles. The factors α_{dyn} and β_{dyn} are chosen dependent on the expected object movement. They determine the damping of velocity and the process noise influence on the prediction step. In this application, these values were chosen to be α_{dyn} = 2 and β_{dyn} = 0.04 as the active sound objects are mostly static.
The particle filter dynamics above describes how the particles move, however as the process noise lead to random directions, the particle swarm still needs to be led towards the locations of the greatest acoustic activity. This achieved by importance resampling, which is a step that relocates particles of low importance to locations of particles with high importance; a relocated particle continues its motion from its new location. In this work this resampling is applied every time instant m, subsequent to the application of the object detection algorithm, which is subsequent to particle position prediction, cf. Algorithm 1. The resampling scheme used in this algorithm is the systematic approach introduced in [56, 53]; other methods are multinomial [48, 53], residual [56, 52], and stratified ([50], is done by sampling).
3.6 Anticausal computation
Since the probabilistic calculations and certain time thresholds of the algorithm impose a lag on the detection of actively trackable sound objects, anticausal lookahead processing is introduced. After the first analysis of the microphone array recordings, a part of the process is repeated in the timereversed direction.
Each object detection event from the causal process is used as starting point for the anticausal processing. The probabilistic detection algorithm is applied in the same manner as in the causal process, however without the ability to initialize new track instances. Only the continuation or removal of existing instances as introduced in Section 3.3 is applied. The particle filter prediction step itself is applied using a negative time step size Δt → −Δt in (44).
The computation of the prediction step, particle weights, and position estimation is only done for time instants m at which the causal processing has not declared the object as active yet.
3.7 Object detection algorithm
Section 3.3 already introduced the concepts of threshold values determining the initialization, continuation, and removal of tracking instances. These values are computed by a probabilistic algorithm introduced in [21, 22] that is adapted to our threedimensional application.
The sequential peak detection algorithm introduced in Section 3.3 yields Q instantaneous observations o_{q} (34) and peak activity values Γ_{q} (33). These peaks can be caused by objects directly, but are still strongly affected by noise, reflections, discretization artifacts, and other interference. Therefore, each detected peak (observation) is evaluated in terms of its viability for the peak tracking algorithm. To accomplish this, the probability of the observation (detected peak) to existing sound objects needs to be quantified, or the likelihood of it belonging to a new object, or it being a false detection. The local relative peak prominence of each peak observation is the relation between first and therefore maximum peak q = 1 and the remaining peaks q > 1
in contrast to empirically defined calculation in [22] or histogram peak values in [23] that are found in literature.
For the subsequent explanation, we assume an active sound scene of s = 1 … S initialized peak tracking instances yielding the position estimates (42). All variables introduced in Section 3.5 now carry the additional index for the tracking instance s. For each soundobject track s, we define the observability as in [22, 23], but denoted differently as
for the time instant m. The object activity A^{(m)} and the object existence E^{(m)} are probability values for the object peaks that are being actively tracked. The activity is computed using the firstorder Markov model
with the transition probabilities p_{A} = 0.95 and representing the state transition from active to active state and inactive to active state, respectively, just as in [22]. The intermediate activity value in (51) is the probabilistic combination
including the object tracking probability P_{s}. The small value ϵ > 0 added to the denominator ensures numerical robustness.
Then, the recursive existence is defined in [22, 23] as the function
that automatically assumes high values of high tracking probability , or maps small values of with smoothed falling slopes in E^{(m)}, with the smoothing constant μ, which is set to 0.5.
The paper [22] introduces the following hypotheses: the observation with index q is caused (1) by a tracked sound object with index s = 1 … S, denoted as , (2) by a false detection caused by interference, denoted as , or (3) by a new sound object, denoted as .
The set of observations o_{q} with q = 1 … Q has to be mapped to these hypotheses. This results in r = 1 … (S + 2)^{Q} mapping combinations (cf. Fig. 7), which have to be evaluated. Assuming conditional independence of the single observations, this is done by computing the association probabilities
for each combination r. The term is the hypothesis mapped to the observation q for the combination r. The term describes the likelihood of an observation at the position o_{q} while mapped to a certain hypothesis. It is defined as
Figure 7 (a) Graphical visualization of possible mapping combinations; (b) with Q = 2 and S = 1, a total number of nine possible combinations exist. The selection introduced with δ_{r}, selects all r where the hypothesis is included, i.e. δ_{r,fa} would lead to the set of r ∈ {1, 2, 3, 4, 7}. 
This involves a priori knowledge. The threedimensional spatial distribution function p_{fa}(o_{q}) describes the probability of an observation o being a false detection, e.g. volumes blocked by obstacles will have a high probability for false detection. Similarly, p_{new}(o_{q}) describes the known probability of new sound objects appearing in the scene, e.g. on a stage. Both can be set to a uniform distribution if there is no specific knowledge is available. The function is not two dimensional as in [22, 23], but defined as a threedimensional multivariate Gaussian distribution
centered at the latest estimate and evaluated at the observation position o_{q}. The covariance is estimated by the particle positions of the corresponding particle filter as
with a scaling, chosen to be c = 4, here.
The second term
describes probabilities incorporating the peak prominence (49) and time instantwise computation of the observability (50). It additionally introduces the factors p_{fa} = 0.8 and p_{new} = 0.2 for tuning the impact of the prominence P_{q}.
The subsequent summation for all combinations containing the hypothesis gives the marginal probabilities of observationhypothesis mappings. This is computed as
where the Kronecker delta denotes the selection from the set of (S + 2)^{Q} mappings where the hypothesis is included (see Fig. 7). The value is required for tracking control in Section 3.4. The object tracking probability of each known sound object is evaluated as
and is also employed in tracking control of Section 3.4.
4 Technical evaluation of object position estimation
The accuracy of sound object detection and positional accuracy was evaluated by simulating sound scene recordings with varying noise interference [47]. The simulated scene was created with the simulation library for MATLAB library MCRoomSim^{2} [57]. The room is a 6 m × 6 m × 3.5 m box model with default room acoustic settings. For this evaluation, four static sound objects, two male and two female speakers from the EBU SQAM collection [58], are situated in the simulated room at static positions (cf. Tab. 1). Microphone arrays are located at 1 m intervals ranging from 3.5 m to 6.5 m on the x and y coordinate. To evaluate the detection capabilities also covering grids with volumetric extent, vertical layers are positioned between 0.5 m and 2.5 m with 1 m spacing along the z axis. This results in 48 virtual microphone arrays. The simulation models microphone arrays as noncoincident using the array geometry of the Oktava MK4012 [37]. The scenes point of origin is located at the center of the room’s floor.
The sound object positions and EBU SQAM collection track numbers. The first 5 s of the signals are used in the simulation.
To measure the accuracy of the algorithm, the simulated scene is analyzed at different levels of noise interference. The SNR here is defined as the relation between the loudest microphone signal and uncorrelated white noise added to all microphone signals independently with the same signal energy. The SNR values for this evaluations were SNR ∈ {9, 12, 15, 18} dB.
4.1 Error measures
Mean Distance Error: The measure is defined as the distance between the ground truth to the nearest sound object. Only 1to1 mappings are allowed, therefore if an object is already in use for calculation then the nextnearest will be used if existent. The measure is computed for every sound object j = 1 … 4 and averaged over N_{trial} trials and M time frames whenever actively tracked sound objects were found:
Activity Error Time: The sound object activity is the time where false positives or false negatives are observed in sound object activity. This is done by summation of time frame lengths where there is no nearest sound object existent or the sound object probability differs from 1. It is a strict measure, for which the resulting error time can be high despite getting accurate results. Here it is used to show the relative improvement with varying noise interference. It is defined as
Case (a) is applied if no nearest sound object could be found that was not assigned yet, while (b) applies whenever the tracked sound object s is nearest to the jth truth value still available. The true value is defined by manual labeling based on the instantaneous magnitude of the sound object.
4.2 Results
The mean distance error (MDE) is low with SNR values at 15 dB and 18 dB, as visualized in Figure 8a. The confidence intervals suggest stable values between 2 and 10 centimeters. On the other hand, large confidence intervals at the smaller SNR values suggest unreliable detection and position estimation. The activity error time (AET), see Figure 8b, shows a similar dependency on SNR confirming the SNR requirement for this system at ≥15 dB.
Figure 8 (a) Higher SNR values in scene recordings yield good positional accuracy in object localization between 2 and 10 cm. (b) The AET of the sound objects decreases with higher SNR. At lower SNRs, the confidence intervals suggest again a strong variation in results indicative of unstable measurement results. Shown are mean, and 95% confidence intervals. 
In summary, the evaluation showed promising accuracy that led us to confide in rendering that might be able to synthesize a spatial auditory image of detected sound objects that closely resembles the one of the recorded scene.
5 Perceptual evaluation
To evaluate the effectiveness of the proposed procedure of rendering, listening experiments have been conducted. A simple scene consisting of two static active sound objects is analysed. The male speaker (Track Nr. 50) and the piano (Track Nr. 60) of the EBU SQAM collection [58]. Here the time interval [1;8] s (piano) and [3;8] s (speech) are used where start of the speech signal is 2 s delayed behind the piano signal. A second scene is simulated where only one of the objects is present, which in turn is used as the baseline for the rendering using the estimated position data from the twoobject scene. The purpose of this approach is to include possible artifacts and inaccuracies stemming from interobject interference in analysis whilst minimizing complexity for the listeners of the experiments.
The experiment was conducted in two parts. First, static listener perspectives were presented to the listener where the virtual listener is positioned at fixed locations facing the active sound object. The second part presented dynamic perspectives, representing a linear motion of the listener perspective along a predefined path. In this case the look direction was varied between four distinct orientations. These predefined modes of motion are used for auralization using a binaural decoder, so that the experiment can be conducted using headphones without the need for head/position tracking equipment.
The methodology of the experiment was a MUSHRAlike [59] comparative evaluation asking for the perceptual similarity to the reference, also described as the authenticity. The conditions of the comparison include the simulated reference, the proposed method, two broadband rendering methods, and an anchor.^{3}
The reference is the binaural rendering of the listener perspective as simulated by MCRoomSim [57].
The proposed condition is a binaural rendering following the procedure introduced in Section 2 and Section 3.
The VLO approach was introduced in [30, 31] and is a broadband spatial rendering method to enable acoustic scene playback with spatially distributed surround recordings. Refer to Figure 3a for an overview. The virtual loudspeaker objects are encoded in thirdorder Ambisonics and decoded to binaural signals with the IEM BinauralDecoder [60].
The VectorBased Intensity Panning (VBIP) approach is a simple superposition of three surround recordings transformed to firstorder Ambisonics signals weighted with the areal coordinate approach. Again, the IEM BinauralDecoder^{4} was used for decoding to headphone signals. A previous study on the performance of VLO and VBIP as done in [61].
To give a lowrating reference, a mono version of the VBIP condition was added as anchor. This was visually not identifiable and part of the randomly sorted multistimulus presentation.
5.1 Experiment
Both the static and dynamic listening tasks used a room measuring 6 m × 6 m × 3.5 m simulated in MCRoomSim with the same grid layout as in the technical evaluation above: equally spaced according to Figure 9 and vertically repeated in three layers at 0.5 m, 1.5 m and 2.5 m height. The simulated arrays model the geometry of the Oktava MK4012 [37].
Figure 9 The perspectives used in the listening experiment. (a) The static perspective positions and the trajectory of the dynamic perspective used in the listening experiment. (b) Lists coordinates of static perspective positions and path start and end point. 
The staticperspective tasks evaluated four listener positions that are visualized in Figure 9. Such a task consisted of a the comparative rating of authenticity where the reference was always visible to the listener and the order of conditions was randomized and not visually identifiable. Each static position was rated twice by each participant, also randomly sequenced within the set of multistimulus tasks.
Further the dynamicperspective tasks consisted of the comparative rating of authenticity with visible reference. The four look directions A, B, C, D shown in Figure 9 were evaluated twice by each participant, in randomized condition and task orders.
The signals were optimized for the most common AKG and Beyerdynamic high end models to minimize coloration. In total, 16 expert listeners aged between 24 and 39 (average age: 29) took part in the listening experiment taking 30 min on average to complete it. The pairwise statistical significance was assessed with a Wilcoxon signed rank test [62] with BonferroniHolm correction [63]. There were 32 responses for each condition yielding 4 × 32 = 128 responses when merging over positions 1 to 4 in part 1 and look directions A–D in part 2. The data proved to be consistent enough to be merged.
All plots in Figure 10 show the sample median of collected sample populations and ≥95% confidence intervals. The choice of the median over the mean is based on its higher robustness towards outliers, especially relatively small sample populations. The confidence intervals are computed by applying the binomial test [64, 65] to the samples of each evaluated condition.
Figure 10 The results of the second experiment comparing dynamic perspectives for (a) single look direction data and conditions as well as the (b) mergedlookdirection data. Pictured are the sample median, the ≥95% confidence intervals and notable statistical significance is marked. 
5.2 Results
Staticperspective tasks: The single position ratings are visualized in Figure 10a (Median and ≥95% confidence intervals). Participants rated the proposed approach higher than all the other conditions, with statistical significance (p < 0.001) at positions 1–3. At position 4, however, VLO rendering shows significantly higher rating (p < 0.001) than VLO at 1–3, and no significant difference to the proposed one (p = 0.0541). The comparison of merged VLO data from Positions 1 and 3 on the one hand and 2 and 4 on the other hand (not displayed) exhibits an advantage (p < 0.001) of the direct perspective positions over the interpolated ones. The ratings of the VBIP approach show a decrease with distance, when ratings are compared with locations in Figure 9. The difference is significant when comparing the farthest and closest position (p < 0.001).
Dynamicperspective tasks: Figure 10c shows, the ratings for all conditions are very similar over look directions A–D suggesting that look direction is of little influence. Moreover, the proposed and VLO methods are consistently rated higher (p < 0.001) than the VBIP condition. Between proposed and VLO, the advantage is not as strong but existent at all look direction A (p = 0.0696), B (p = 0.0218), C (p < 0.001) or D (p = 0.0650). This experiment supports the findings of [61] where the VLO approach performed better than the VBIP approach.
The merged responses across all directions of the dynamicperspective experiment (Fig. 10d) imply a significant mean difference (p < 0.001) between ratings of proposed and VLO/VBIP, supporting the results of the staticperspective experiment (Fig. 10b).
5.3 Discussion
This listening evaluation confirmed the intended improvements in scene resynthesis by comparison with established broadband methods. Despite the limited sample size, most of the statistical results were significant. The static and dynamicperspective parts of the experiment indicated a significant increase in authenticity of the proposed methods over the compared ones.
For the staticperspective part, the authenticity, i.e. the similarity to the reference, of the proposed is consistently rated high, and in the majority of the cases higher than the alternatives. As the one exception, the observed drop off at position 4 is most likely due to the high rating of VLO and combined with the limited scale of the ratings. Further, VLO shows ratings seemingly depending on the listener position and the significant increase in rating at position 4 is due to the fact that this position is a direct microphone array position and far away from a source. There, the surround perspective of the microphone position provides accurate reproduction just as with VBIP, however providing a better room impression owed to the rich diversity of VLOs and their directions. By contrast, spatial reproduction with VLO and VBIP suffers at interpolated, less diffuse listener perspectives.
The dynamicperspective part of the experiment shows an increase in ratings for VLO as the interpolation is perceived smoother and is rated almost as high as proposed. As residual signals of proposed are based on the VLO approach, smoothness and general impression are understandably similar whenever the virtual listener has moved away from the sources. The advantages lie in auditory localization of direct sounds through the proposed object direct signal extraction and encoding. The data shows this improvement in the singledirection as well as the mergeddata comparisons.
6 Conclusion
In this contribution we proposed an analysis and resynthesis method for acoustic scenes recorded with distributed surround microphone arrays, in the investigated case tetrahedral Aformat Ambisonics microphones. We could show that multiperspective recordings provide sufficiently much additional information for rendering with significantly improved spatial accuracy and authenticity, already when performed broadband, in the time domain, only. This effectively avoids any risk of introducing musicalnoise artifacts that any potentially more effective timefrequency processing intrinsically bears.
A numerical experiment considered sound objects of a simulated scene and could prove good accuracy in object position and signal activity estimation, and it revealed a 15 dB SNR or direct to diffuse ratio limit that local microphones around the active sound object should be able to satisfy. We could further verify the suspected improvement compared to two known broadband perspective interpolation approaches in a twopart listening experiment; its results show improvements in authenticity and spatial definition.
MATLAB implementation SphericalHarmonicTransform by Archontis Politis available at https://github.com/polarch/SphericalHarmonicTransform.
Library available at https://github.com/Andronicus1000/MCRoomSim.
Conditions available at https://phaidra.kug.ac.at/o:107325.
Plugin available at https://plugins.iem.at/.
References
 T. Pihlajamäki, V. Pulkki: Projecting simulated or recorded spatial sound onto 3dsurfaces, in AES Conference: 45th International Conference: Applications of TimeFrequency Processing in Audio, 03 2012. Available: http://www.aes.org/elib/browse.cfm?elib=16198. [Google Scholar]
 T. Pihlajamäki, V. Pulkki: Synthesis of complex sound scenes with transformation of recorded spatial sound in virtual reality. Journal of the Audio Engineering Society 63 (2015) 542–551. Available: http://www.aes.org/elib/browse.cfm?elib=17840. [Google Scholar]
 V. Pulkki: Directional audio coding in spatial sound reproduction and stereo upmixing, in AES Conference: 28th International Conference: The Future of Audio Technology – Surround and Beyond, 06, 2006. Available: http://www.aes.org/elib/browse.cfm?elib=13847. [Google Scholar]
 V. Pulkki, A. Politis, M.V. Laitinen, J. Vilkamo, J. Ahonen: Firstorder directional audio coding (dirac). Parametric TimeFrequency Domain Spatial Audio 10 (2017) 89–140. https://doi.org/10.1002/9781119252634.ch5. [Google Scholar]
 A. Plinge, S.J. Schlecht, O. Thiergart, T. Robotham, O. Rummukainen, E.A.P. Habets: Sixdegreesoffreedom binaural audio reproduction of firstorder ambisonics with distance information, in: AES International Conference on Audio for Virtual and Augmented Reality 08 2018). Available: http://www.aes.org/elib/browse.cfm?elib=19684. [Google Scholar]
 N. Barrett, S. Berge: A new method for bformat to binaural transcoding, in Audio Engineering Society Conference: 40th International Conference: Spatial Audio: Sense the Sound of Space, 10, 2010. Available: http://www.aes.org/elib/browse.cfm?elib=15527. [Google Scholar]
 E. Stein, M.M. Goodwin: Ambisonics depth extensions for six degrees of freedom, in AES Conference: 2019 AES International Conference on Headphone Technology, 08 2019, Available: http://www.aes.org/elib/browse.cfm?elib=20514. [Google Scholar]
 A. Allen, B. Kleijn: Ambisonic soundfield navigation using directional decomposition and path distance estimation, in ICSA, Graz, Austria, 09 2017. [Google Scholar]
 M. Kentgens, A. Behler, P. Jax: Translation of a higher order ambisonics sound scene based on parametric decomposition, in IEEE ICASSP (2020) 151–155. https://doi.org/10.1109/ICASSP40776.2020.9054414. [Google Scholar]
 L. Birnie, T. Abhayapala, P. Samarasinghe, V. Tourbabin: Sound field translation methods for binaural reproduction, in IEEE WASPAA (2019) 140–144. Available: https://doi.org/10.1109/WASPAA.2019.8937274. [Google Scholar]
 E. Bates, H. O’Dwyer, K.P. Flachsbarth, F.M. Boland: A recording technique for 6 degrees of freedom VR, in AES Convention, Vol. 144. Audio Engineering Society, 05 2018. Available: http://www.aes.org/elib/browse.cfm?elib=19418 [Google Scholar]
 H. Lee: A new multichannel microphone technique for effective perspective control, in AES Convention, Vol. 140. Audio Engineering Society, 05 2011. Available: https://www.aes.org/elib/browse.cfm?elib=15804. [Google Scholar]
 A. Brutti, M. Omologo, P. Svaizer: Localization of multiple speakers based on a two step acoustic map analysis. IEEE ICASSP (2008) 4349–4352. Available: https://doi.org/10.1109/ICASSP.2008.4518618. [Google Scholar]
 A. Brutti, M. Omologo, P. Svaizer: Multiple source localization based on acoustic map deemphasis. EURASIP Journal on Audio, Speech, and Music Processing 2010 (2010). 147495. https://doi.org/10.1155/2010/147495. [Google Scholar]
 P. Hack, Multiple source localization with distributed tetrahedral microphone arrays. Master’s Thesis, Institute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Graz, Austria, 2015. Available: http://phaidra.kug.ac.at/o:12797 [Google Scholar]
 G. Del Galdo, O. Thiergart, T. Weller, E.A. Habets: Generating virtual microphone signals using geometrical information gathered by distributed arrays, in 2011 Joint Workshop on Handsfree Speech Communication and Microphone Arrays, IEEE, 05 2011. Available: https://doi.org/10.1109. [Google Scholar]
 O. Thiergart, G. Del Galdo, M. Taseska, E.A.P. Habets: Geometrybased spatial sound acquisition using distributed microphone arrays. IEEE Transactions on Audio, Speech, and Language Processing 21 (2013) 2583–2594. https://doi.org/10.1109/TASL.2013.2280210. [Google Scholar]
 X. Zheng: Soundfield navigation: Separation, compressionand transmission. Ph.D. Dissertation, University of Wollongong, 2013. Available: https://ro.uow.edu.au/theses/3943/. [Google Scholar]
 D.B. Ward, E.A. Lehmann, R.C. Williamson: Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Transactions on Speech and Audio Processing 11 (2003) 11 https://doi.org/10.1109/TSA.2003.818112. [Google Scholar]
 M.F. Fallon, S.J. Godsill: Acoustic source localization and tracking of a timevarying number of speakers. IEEE Transactions on Audio, Speech, and Language Processing 20 (2012) 1409–1415. https://doi.org/10.1109/TASL.2011.2178402. [Google Scholar]
 J.M. Valin, F. Michaud, J. Rouat: Robust 3D localization and tracking of sound sources using beamforming and particle filtering. IEEE ICASSP (2006). https://doi.org/10.1109/ICASSP.2006.1661100. [Google Scholar]
 J.M. Valin, F. Michaud, J. Rouat: Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Elsevier Science 55 (2007) 216–228. Available: https://arxiv.org/pdf/1602.08139.pdf. [Google Scholar]
 S. Kitić, A. Guérin: Tramp: Tracking by a realtime ambisonicbased particle filter, in LOCATA Challenge Workshop, 09 2018. Available: https://arxiv.org/abs/1810.04080. [Google Scholar]
 J.G. Tylka, E. Choueiri. Soundfield navigation using an array of higherorder ambisonics microphones, in AES International Conference on Audio for Virtual and Augmented Reality, 09 (2016). Available: http://www.aes.org/elib/browse.cfm?elib=18502. [Google Scholar]
 J.G. Tylka, E.Y. Choueiri: Domains of practical applicability for parametric interpolation methods for virtual sound field navigation. Journal of the Audio Engineering Society 67 (2019) 882–893. Available: http://www.aes.org/elib/browse.cfm?elib=20702. [Google Scholar]
 J.G. Tylka: Virtual navigation of ambisonicsencoded sound fields containing nearfield sources. PhD dissertation, Princeton University, 2019. Available: http://arks.princeton.edu/ark:/88435/dsp011544br958. [Google Scholar]
 N. Mariette, B.F.G. Katz, K. Boussetta, O. Guillerminet: Sounddelta: A study of audio augmented reality using wifidistributed ambisonic cell rendering. AES Convention, Vol. 128. Audio Engineering Society, 2010. Available: http://www.aes.org/elib/browse.cfm?elib=15420. [Google Scholar]
 C. Schörkhuber, R. Höldrich, F. Zotter: Tripletbased variableperspective (6DoF) audio rendering from simultaneous surround recordings taken at multiple perspectives, in Fortschritte der Akustik (DAGA), Hannover, Germany, 04 2020. Available: https://pub.degaakustik.de/DAGA_2020/data/articles/000295.pdf. [Google Scholar]
 E. Patricio, A. Rumiński, A. Kuklasiński, L. Januszkiewicz, T. Żernicki: Toward six degrees of freedom audio recording and playback using multiple ambisonics sound fields. AES Convention, Vol. 146, Audio Engineering Society, 2019. Available: http://www.aes.org/elib/browse.cfm?elib=20274. [Google Scholar]
 P. Grosche, F. Zotter, C. Schörkhuber, M. Frank, R. Höldrich: Method and apparatus for acoustic scene playback. Patent WO2018077379A1 (2018). Available: https://patents.google.com/patent/WO2018077379A1. [Google Scholar]
 F. Zotter, M. Frank, C. Schörkhuber, R. Höldrich: Signalindependent approach to variableperspective (6DoF) audio rendering from simultaneous surround recordings taken at multiple perspectives, in Fortschritte der Akustik (DAGA), Hannover, Germany. 04 2020. Available: https://pub.degaakustik.de/DAGA_2020/data/articles/000458.pdf. [Google Scholar]
 D. Rivas Méndez, C. Armstrong, J. Stubbs, M. Stiles, G. Kearney: Practical recording techniques for music production with sixdegrees of freedom virtual reality. AES Convention, Vol. 145, Audio Engineering Society, 2015. Available: http://www.aes.org/elib/browse.cfm?elib=19729. [Google Scholar]
 F. Zotter, M. Frank: Ambisonics, 1st edn., Vol. 19 of Springer Topics in Signal Processing, Springer International Publishing, 2019. https://doi.org/10.1007/9783030172077. [Google Scholar]
 A. Politis: Microphone array processing for parametric spatial audio techniques. PhD dissertation, Aalto University, 2016. Available: http://urn.fi/URN:ISBN:9789526070377. [Google Scholar]
 J. Ivanic, K. Ruedenberg: Rotation matrices for real spherical harmonics. direct determination by recursion. The Journal of Physical Chemistry 100 (1996) 6342–6347. https://doi.org/10.1021/jp953350u. [Google Scholar]
 C. Schörkhuber, M. Zaunschirm, R. Höldrich: Binaural rendering of ambisonic signals via magnitude least squares, in Fortschritte der der Akustik (DAGA), Munich, Germany, 03 2018. Available: https://pub.degaakustik.de/DAGA_2018/data/articles/000301.pdf. [Google Scholar]
 Oktava GmbH: Oktava mk4012 (2019). Available: http://www.oktavashop.com/images/product_images/popup_images/4012.jpg. [Google Scholar]
 E. Hille: Analytic Function Theory, 2nd edn., Vol. 1. Chelsea Publishing Company, New York, 1982. [Google Scholar]
 A. Politis, S. DelikarisManias, V. Pulkki: Directionofarrival and diffuseness estimation above spatial aliasing for symmetrical directional microphone arrays. IEEE ICASSP (2015) 6–10. https://doi.org/10.1109/ICASSP.2015.7177921. [Google Scholar]
 T. Wilding: System parameter estimation of acoustic scenes using first order microphones, Master’s thesis. Institute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Graz, Austria, 2016. Available: http://phaidra.kug.ac.at/o:40685. [Google Scholar]
 Z. He, A. Cichocki, S. Xie, K. Choi: Detecting the number of clusters in nway probabilistic clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2010) 2006–2021. https://doi.org/10.1109/TPAMI.2010.15. [Google Scholar]
 M. Kronlachner: Spatial transformations for the alteration of ambisonic recordings. Master’s thesis (2014). Available: http://phaidra.kug.ac.at/o:8569. [Google Scholar]
 M. Hafsati, N. Epain, J. Daniel: Editing ambisonics sound scenes. ICSA, Graz, Austria, 09 2017. [Google Scholar]
 M. Jeffet, B. Rafaely: Study of a generalized spherical array beamformer with adjustable binaural reproduction (2014) 77–81. https://doi.org/10.1109/HSCMA.2014.6843255. [Google Scholar]
 N. Shabtai, B. Rafaely: Generalized spherical array beamforming for binaural speech reproduction. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (2014) 238–247. https://doi.org/10.1109/TASLP.2013.2290499. [Google Scholar]
 M. Jeffet, N. Shabtai, B. Rafaely: Theory and perceptual evaluation of the binaural reproduction and beamforming tradeoff in the generalized spherical array beamformer. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (2016) 708–718. https://doi.org/10.1109/TASLP.2016.2522649. [Google Scholar]
 M. Blochberger: Multiperspective scene analysis from tetrahedral microphone recordings. Master’s thesis (2020). Available: https://phaidra.kug.ac.at/o:104549. [Google Scholar]
 B. Efron, R. Tibshirani: An Introduction to the Bootstrap, Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 1994. Available: https://books.google.at/books?id=gLlpIUxRntoC. [Google Scholar]
 J.S. Liu, R. Chen: Blind deconvolution via sequential imputations. Journal of the American Statistical Association 90 (1995) 567–576. https://doi.org/10.1080/01621459.1995.10476549. [Google Scholar]
 P. Fearnhead: Sequential monte carlo methods in filter theory. PhD dissertation, University of Oxford, 1998. [Google Scholar]
 G. Kitagawa: Monte carlo filter and smoother for nongaussian nonlinear state space models. Journal of Computational and Graphical Statistics 5 (1996) 1–25. https://doi.org/10.1080/10618600.1996.10474692. [Google Scholar]
 J. Liu, R. Chen: Sequential Monte Carlo methods for dynamic systems. Journal of the American Statistical Association 93 (1998) 1032–1044. https://doi.org/10.1080/01621459.1998.10473765. [Google Scholar]
 J. Carpenter, P. Clifford, P. Fearnhead: An improved particle filter for nonlinear problems. IEE Proceedings Radar Sonar and Navigation 146 (1999) 2–7. https://doi.org/10.1049/iprsn:19990255. [Google Scholar]
 A. Doucet, N. de Freitas, N. Gordon: Sequential Monte Carlo Methods in Practice, 1st edn., Information Science and Statistics. SpringerVerlag, New York, 2001. https://doi.org/10.1007/9781475734379. [Google Scholar]
 S. Särkkä: Bayesian Filtering and Smoothing, Institute of Mathematical Statistics Textbooks. Cambridge University Press, 2013. https://doi.org/10.1017/CBO9781139344203. [Google Scholar]
 D. Whitley: A genetic algorithm tutorial. Statistics and Computing 4 (1994) 65–85. https://doi.org/10.1007/BF00175354. [Google Scholar]
 A. Wabnitz, N. Epain, C. Jin, A. van Schaik: Room acoustics simulation for multichannel microphone arrays. ISRA, Melbourne, Australia, 08 2010. Available: https://www.acoustics.asn.au/conference_proceedings/ICA2010/cdromISRA2010/Papers/P5d.pdf. [Google Scholar]
 EBU: Sound Quality Assessment Material recordings for subjective tests, 2008. Available: https://tech.ebu.ch/publications/sqamcd. [Google Scholar]
 ITU, ITUR BS.15343: Method for the subjective assessment of intermediate quality level of audio systems, 2015. Available: https://www.itu.int/rec/RRECBS.15343201510I. [Google Scholar]
 D. Rudrich: IEM Plugin Suite. IEM, 2019. Available: https://plugins.iem.at/. [Google Scholar]
 D. Rudrich, F. Zotter, M. Frank: Evaluation of interactive localization in virtual acoustic scenes. Fortschritte der Akustik (DAGA), Kiel, Germany, 09 2017. Available: https://pub.degaakustik.de/DAGA_2017/data/articles/000182.pdf. [Google Scholar]
 F. Wilcoxon: Individual comparisons by ranking methods. Biometrics Bulletin 1 (1945) 80–83. Available: http://www.jstor.org/stable/3001968. [Google Scholar]
 S. Holm: A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6 (1979) 65–70. Available: http://www.jstor.org/stable/4615733. [Google Scholar]
 D. Altman, D. Machin, T. Bryant, M. Gardner: Statistics with Confidence, Confidence Intervals and Statistical Guidelines, 2nd edn., BMJ Books (2000). [Google Scholar]
 M. Eid, M. Gollwitzer, M. Schmitt: Statistik und Forschungsmethoden, 5th edn. Julius Beltz, 2017. [Google Scholar]
Cite this article as: Blochberger M & Zotter F. 2021. Particlefilter tracking of sounds for frequencyindependent 3D audio rendering from distributed Bformat recordings. Acta Acustica, 5, 20.
All Tables
The sound object positions and EBU SQAM collection track numbers. The first 5 s of the signals are used in the simulation.
All Figures
Figure 1 Diagram of the rendering algorithm. Position data (cf. Sect. 3) of sound objects is used to render the Ambisonics listener perspective using real time tracking of the listener position and head rotation. 

In the text 
Figure 2 Three perspective extraction. 

In the text 
Figure 3 The VLO method [30, 31] is visualized in (a) while (b) shows the deemphasis term introduced in eqaution (18). 

In the text 
Figure 4 The procedure of object detection and object position estimation. *I/R signifies the decisions to initialize, continue or remove particle filter instances as introduced in Section 3.4. 

In the text 
Figure 5 The acoustic activity map evaluated at equidistant grids on horizontal planes for visualization with two visible peaks. 

In the text 
Figure 6 (a) visualizes the problem of ghost peaks. (b) and (c) show the application of the peak deletion function, which removes directional components from the DOA maps . Intersection of directional information can lead to ghost peaks in threedimensional data. DOA map before peak deletion. Maximum at (−45, −40). Renormalized DOA map after peak deletion. Maximum at (45, 0). 

In the text 
Figure 7 (a) Graphical visualization of possible mapping combinations; (b) with Q = 2 and S = 1, a total number of nine possible combinations exist. The selection introduced with δ_{r}, selects all r where the hypothesis is included, i.e. δ_{r,fa} would lead to the set of r ∈ {1, 2, 3, 4, 7}. 

In the text 
Figure 8 (a) Higher SNR values in scene recordings yield good positional accuracy in object localization between 2 and 10 cm. (b) The AET of the sound objects decreases with higher SNR. At lower SNRs, the confidence intervals suggest again a strong variation in results indicative of unstable measurement results. Shown are mean, and 95% confidence intervals. 

In the text 
Figure 9 The perspectives used in the listening experiment. (a) The static perspective positions and the trajectory of the dynamic perspective used in the listening experiment. (b) Lists coordinates of static perspective positions and path start and end point. 

In the text 
Figure 10 The results of the second experiment comparing dynamic perspectives for (a) single look direction data and conditions as well as the (b) mergedlookdirection data. Pictured are the sample median, the ≥95% confidence intervals and notable statistical significance is marked. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.