Open Access
Issue
Acta Acust.
Volume 5, 2021
Article Number 20
Number of page(s) 16
Section Virtual Acoustics
DOI https://doi.org/10.1051/aacus/2021012
Published online 30 April 2021

© M. Blochberger and F. Zotter, Published by EDP Sciences, 2021

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

The interactive rendering of recorded auditory scenes as virtual listening environments requires an approach to allow six Degrees of Freedom (6DoF) of movement for a variable listener perspective. The variable-perspective rendering of auditory scenes requires interpolation between static recording perspective positions. In existing research, this concept is often referred to as scene navigation or also scene walk-through. This contribution mainly refers to first-order tetrahedral microphone arrays as means for recording surround audio for high fidelity applications.

While volumetrically navigable 6DoF recording and rendering are theoretically feasible, practical distributions of multiple static 3D audio recordings typically consider capturing perspective changes along the horizontal dimensions to enable walkable rendering of the auditory scene.

Perspective extrapolation of a single perspective for a shifted listening position has been considered in the SpaMoS (spatially modified synthesis) method by Pihlajamäki and Pulkki [1, 2] that estimates time-frequency-domain source positions by projecting directional signal detections of DirAC (directional audio coding [3, 4]) onto a pre-defined convex hull (e.g. the room walls). This method is assumed to be accurate in spatial reproduction when the parallactic shift with regard to the original perspective recording stays small. Similarly, Plinge et al. [5] utilize DirAC in combination with known distance information to rotate and attenuate sources in a single-perspective recording to extrapolate its perspective. This approach would be expandable by multiple directional signals obtained via HARPEX by Barrett and Berge [6] for first-order signals, as Stein and Goodwin suggest in [7]. Higher-order Ambisonics signals are used in a work by Allen and Kleijn [8] that employs a matching-pursuit algorithm for multi-directional signal decomposition, takes into account estimated source distances, and labels reflections and direct sounds. Kentgens et al. [9] apply alternative multi-directional decomposition, e.g., the subspace methods SORTE/MUSIC for extraction and shifted re-encoding of direct components, complemented by a noise and diffuseness subspace that preserves ambient sounds. Birnie et al. [10] introduce a sound field translation method for 6DoF binaural rendering based on sparse plane-wave expansion for near and far sources arranged on two rings around the higher-order recording perspective. Altogether, the single-perspective extrapolation approaches require either parametric time-frequency processing or higher order microphone arrays to achieve directional definition, and their extrapolation range depends on how successful distance information is guessed or estimated. Alternatively, Bates and O’Dwyer [11], Lee et al. [12] employ a more classical, spaced array augmented by controllable-directivity microphones to simulate an extrapolated listening perspective.

Multiple perspectives contain additional information needed for explicit acoustic source localization that enlarges the supported range of shifted listening perspectives with high spatial definition. Brutti et al. [13, 14], Hack [15] or Del Galdo et al. [16, 17] introduce object localization methods using maps of the acoustic activity to localize or triangulate the sources within a scene. In [1315], sequential peak picking algorithms are proposed to avoid erroneous detections in correlation- and intensity-based triangulation respectively. The virtual microphone method [16, 17] utilizes the detected location every short-time frequency bin to assemble a virtual microphone signal of an arbitrary parametrically rendered directivity pattern, what in principle would allow parametric 6DoF rendering. Zheng [18] combines sound object detection with time-frequency-domain signal separation to extract direct signals for sound field navigation. Particle filters to track sound sources detected in reverberant environments were described by Ward et al. [19], and Fallon et al. [20] introduce a particle filter-based acoustic source tracking algorithm for a time-varying number of sources. Probabilistic detection was combined with particle filters for directional tracking of sounds was introduced by Valin et al. [21, 22] for robotic applications, which was later adapted by Kitić and Guérin [23] to a first-order Ambisonics application.

Multi-perspective recordings also permit perspective interpolation methods with information about source locations staying implicit. Tylka and Choueiri [2426] proposed interpolatory spherical-harmonic re-expansion at low frequencies for multi-perspective first- or higher-order Ambisonic recordings. Similar to the simplified treatment of high-frequencies in Tylka et al., there is also a number of simplistic broadband signal interpolation methods working in the time domain, only. Such methods avoid any risk of introducing musical noise artifacts, which can happen in time-frequency-domain processing. Mariette et al. [27] mix the first-order Ambisonics signals of the three nearest recording positions proportional to their proximity, similar to the proposal of Schörkhuber et al. [28] that introduces additional amplitude-dependent gains. Patricio et al. [29] propose a distance-based linear interpolation between higher-order Ambisonic recordings in the time domain, in which the higher-order content of distant microphones is faded out, while proximal recording perspectives remain unaltered. Multiple perspective recordings are mapped to multiple surround playback rings of virtual loudspeaker objects (VLOs) by Grosche et al. [30] and Zotter et al. [31], of which the direction and amplitude (involving a distance and directivity function) vary with the listener position and are employed in higher-order re-encoding of the VLO signals. A good introduction to scene recording and sound field interpolation especially in multimedia VR applications is given by Rivas-Mendez et al. [32].

For simplicity, simulations and experiments in this contribution deal with an equidistant grid of recording perspectives volumetrically distributed within a homogeneous auditory scene, however the approach introduced is more general. We introduce a multi-perspective interpolation method that merges and extends detection/tracking and broad-band signal processing concepts found in literature. A broadband signal extraction and rendering method is utilized for artifact-free signal processing in combination with automatic signal detection and position estimation for higher spatial accuracy. The estimated position of any detected object is used to steer broadband beamformers at the nearest recording positions to capture the object’s direct sound. Weighted and delay-compensated combinations of the extracted signals yield approximated direct signals, while residual signals with direct-sound directions suppressed aim to reintroduce enveloping components of the diffuse sound field. Signal extraction and encoding procedures are described in Section 2 and scene analysis procedures in Section 3. Detection accuracy is technically investigated in Section 4 under varying SNR conditions. To assess the performance and achievable improvement of the proposed algorithm applied to a simple acoustic scene recording with static objects, a two-part listening experiment compares the rendering method with two existing broadband 6DoF rendering methods in Section 5, for a static and a moving listener.

2 Frequency-independent 6DoF Rendering

Given the listener position, the microphone array positions, and assuming to know the sound source positions, we can compute the signals customized to the acoustic perspective of a single virtual listener, delivered as a stream of higher-order Ambisonics [33] signals. This delivery format features the benefits of modular decoders facilitating playback on headphones or loudspeaker layouts and the Ambisonic sound field rotation operator. In case there are multiple virtual listeners, the listening-position-dependent processing steps are carried out interactively and separately for each listener, excluding the sound-scene analysis steps that are pre-computed offline.

2.1 Signal encoding and decoding

The multi-channel Ambisonics signal χ(t) of the listener perspective of the order N is computed by multiplication of a single-channel signal with the weights of an encoder yN(θ). Such an encoder consists of the N3D-normalized spherical harmonics yN(θ)T=[Y00(θ),Y1-1(θ), ,Y11(θ),,YNN(θ)]$ {{y}}_N({\theta }{)}^{\mathrm{T}}=[{Y}_0^0({\theta }),{Y}_1^{-1}({\theta }),\enspace \dots,{Y}_1^1({\theta }),\dots,{Y}_N^N({\theta })]$ evaluated at the unit-length direction vector θ = [cosφ sinϑ, sinφ sinϑ, cosϑ]. The theoretical background of spherical harmonics and the concept of Ambisonics can be found in literature, e.g. [33], and practical implementation of encoders and decoders alike is easily accomplished with libraries such as introduced by [34].1 We will encode the listening-position-dependent object direct signals (cf. Sect. 2.2) and the residual signals (cf. Sect. 2.3), depending on the relative direction vector θ to the listener. For S signals in total, the signals si(t), their gains gi, and the instantaneous direction θi = θi(t) of each signal, encoding is defined as

χ(t)=i=1SyN(θi) si(t) gi.$$ \mathbf{\chi }(t)=\sum_{i=1}^S {{y}}_N({{\theta }}_i)\enspace {s}_i(t)\enspace {g}_i. $$(1)

Rotation of the Ambisonics perspective is necessary with headphone playback (Fig. 1). It enables taking into account the listener’s head rotation to simulate a static acoustic scene outside the headphones. This is done by applying a (N + 1)2 × (N + 1)2 rotation matrix R(φϑγ) (cf. [35, 33]) to the Ambisonics signal

χ̃(t)=χ(t)R(φ,ϑ,γ).$$ \stackrel{\tilde }{\mathbf{\chi }}(t)=\mathbf{\chi }(t)R\left(\phi,\mathrm{\vartheta },\gamma \right). $$(2)

thumbnail Figure 1

Diagram of the rendering algorithm. Position data (cf. Sect. 3) of sound objects is used to render the Ambisonics listener perspective using real time tracking of the listener position and head rotation.

To render the headphone signals from the variable-rotation Ambisonics signal χ̃$ \stackrel{\tilde }{\mathbf{\chi }}$, the MagLS binaural decoder is used, as described in [36]. With the exception of this binaural decoder, no frequency-dependent signal processing is introduced by the signal processing proposed.

The upcoming sections introduce the methods to extract object direct signals and residual signals from the multi-perspective recordings, and they define gain and direction values for the encoding step in (1).

2.2 Object direct signal extraction

The object direct signals are approximations of the signal emitted from audible objects in the scene arriving at the virtual listener. Their positions are assumed to be known here; Section 3 thereafter introduces position estimation.

Figure 2 shows the positions of three microphone arrays, an object position, and the listener position, which are all required to approximate the direct signal of the object as a weighted sum. In general, for each known or estimated sound object with position inside the scene, a simplex of surrounding microphone arrays is selected to define the closest set of surrounding recordings taken. The signals at its vertices are likely to capture the cleanest instances of the object’s direct sound. Initial weights for the signals at these vertices are obtained from the area coordinates, the barycentric coordinates of a point in the triangle (Fig. 2) as the typical simplex when recording positions are distributed horizontally, cf. (7); or within a tetrahedron if recordings were distributed volumetrically. But first, the microphone arrays at any of these position are a spherical constellation of transducers, and θm,j denotes the transducer direction vectors. In our case, the index range is j = 1 … 4, as the arrays considered are tetrahedral Oktava MK 4012 [37]. The tetrahedral cardioid microphone signals of each recording perspective are encoded into N3D-normalized first-order Ambisonics by

ψ(t)=diag{[1, 3, 3, 3]}j=14y1(θm,j)sp,j(t),$$ \mathbf{\psi }(t)=\mathrm{diag}\left\{\left[1,\enspace \sqrt{3},\enspace \sqrt{3},\enspace \sqrt{3}\right]\right\}\sum_{j=1}^4 {{y}}_1\left({{\theta }}_{\mathrm{m},j}\right){s}_{p,j}(t), $$(3)

assuming the jth recorded transducer signals of the perspective p is denoted as sp,j(t). From this, a signal

sobj(t)gTψ(t),$$ {s}_{\mathrm{obj}}(t)\approx {{g}}^{\mathrm{T}}\mathbf{\psi }(t), $$(4)

is extracted by applying beamforming vector g that steers towards the object position, here denoted as −θop,i, which is illustrated in Figure 2. This beamforming vector is modelled as on-axis-normalized maximum-directivity (in first order: hypercardioid) and can be computed as

g=y1(-θop,i)y1T(-θop,i)y1(-θop,i)=y1(-θop,i)||y1(-θop,i)||2,$$ {g}=\frac{{{y}}_1\left(-{{\theta }}_{\mathrm{op},i}\right)}{{{y}}_1^{\mathrm{T}}\left(-{{\theta }}_{\mathrm{op},i}\right){{y}}_1\left(-{{\theta }}_{\mathrm{op},i}\right)}=\frac{{{y}}_1\left(-{{\theta }}_{\mathrm{op},i}\right)}{{||{{y}}_1\left(-{{\theta }}_{\mathrm{op},i}\right)||}^2}, $$(5)

where y1 is the encoder of first order in the negative object-perspective direction −θop,i.

thumbnail Figure 2

Three perspective extraction.

To combine the three beamforming-extracted signals of an object from the triplet i = 1 … 3 of into a single direct signal, a combined gain is defined for the object s and perspective i

gs,i=gtri,s,i gdir,s,ij=13gtri,s,j gdir,s,jwith i=13.$$ {g}_{s,i}=\sqrt{\frac{{g}_{\mathrm{tri},s,i}\enspace {g}_{\mathrm{dir},s,i}}{\sum_{j=1}^3 {g}_{\mathrm{tri},s,j}\enspace {g}_{\mathrm{dir},s,j}}}\hspace{1em}\mathrm{with}\enspace i=1\dots 3. $$(6)

Herein, the gain gdir,s,i denotes the areal or barycentric coordinate weight, cf. [38]. Assuming a projected sound object position x̂2D,s$ {\widehat{{x}}}_{2\mathrm{D},s}$, it favors the closest perspective from the given triplet of projected positions p2D,1, p2D,2, p2D,3 and yields

[gtri,s,2gtri,s,3]=C-1(x̂2D,s-p2D,1)$$ \left[\begin{array}{c}{g}_{\mathrm{tri},s,2}\\ {g}_{\mathrm{tri},s,3}\end{array}\right]={\mathbf{C}}^{-1}\left({\widehat{{x}}}_{2\mathrm{D},s}-{{p}}_{2\mathrm{D},1}\right) $$(7)

with

C=[p2D,2-p2D,1 p2D,3-p2D,1].$$ \mathbf{C}=[{{p}}_{2\mathrm{D},2}-{{p}}_{2\mathrm{D},1}\enspace {{p}}_{2\mathrm{D},3}-{{p}}_{2\mathrm{D},1}]. $$(8)

The remaining value is computed by

gtri,s,1=1-(gtri,s,2+gtri,s,3).$$ {g}_{\mathrm{tri},s,1}=1-({g}_{\mathrm{tri},s,2}+{g}_{\mathrm{tri},s,3}). $$(9)

The factor gdir,s,i quantifies the alignment of the object-perspective directions θop,i with the object-listener direction θol,s. Its task is to favor perspectives that are directionally aligned with the direction of radiation from the source to the listener, and it hereby favors the most suitable surrounding perspectives in terms of source directivity. It is formalized as cardioid

gdir,s,i=(1+θop,iTθol,s2)α$$ {g}_{\mathrm{dir},s,i}={\left(\frac{1+{{\theta }}_{\mathrm{op},i}^{\mathrm{T}}{{\theta }}_{\mathrm{ol},s}}{2}\right)}^{\alpha } $$(10)

of the order α. Again, the vectors θop,i and θol,s are unit vectors and illustrated by Figure 2.

The signal encoding as introduced with (1) requires direction vector, signal and gain. The object-listener direction θol,s is employed to compute the encoder (Sect. 2.1) for the approximated object direct signal, which is the combination of the areal coordinate gain-weighted (6) and delay-compensated extracted signals (4)

s¯s(t)=i=13gs,i sobj,i(t-Δts,i).$$ {\bar{s}}_s(t)=\sum_{i=1}^3 {g}_{s,i}\enspace {s}_{\mathrm{obj},i}\left(t-\Delta {t}_{s,i}\right). $$(11)

Here, delay compensation is done based on the speed of sound c and distance differences Δts,uc−1 (dol,sdop,s,i), where the distances dol,s and dop,s,i denote the object-listener and object-perspective distances for the object s and perspective i = 1 … 3 in the triplet, respectively, see Figure 2. To model a realistic distance-dependent amplitude attenuation of the signals within the Ambisonic listener perspective the areal coordinate gains are first multiplied by the object-perspective to object-listener distance ratio. Then the combination

gs=min{4, i=13gs,i dop,s,idol,s}$$ {g}_s=\mathrm{min}\left\{4,\enspace \sum_{i=1}^3 {g}_{s,i}\enspace \frac{{d}_{\mathrm{op},s,i}}{{d}_{\mathrm{ol},s}}\right\} $$(12)

is the gain that is employed in (1) and depends on the distance between listener and source. It is limited to a maximum of 4 (+12 dB) to make avoid excessive boosts whenever dol,s becomes small.

2.3 Object direct signal suppression (residual signals)

In the optimal case, the approximated object direct signals (11) exclude all room information such as early reflections and late reverberation. They provide a clean signal for accurate directional perception, however do not convey a realistic room impression to the listener. To this end, the residual signals are introduced. A similar concept of direct signal suppression, despite in the higher-order Ambisonics domain, was employed in [9] to extract ambient components.

Here, the concept of a residual signal is implemented in terms of the virtual loudspeaker object (VLO) approach [30, 31] that is illustrated in Figure 3a. Each perspective pp holds a number of microphones of the directions θm,j. Each direction and microphone signal is represented by a VLO that is positioned at a finite distance R from the perspective, pp + Rθm,j. The normalized direction vector from the listener position to the virtual loudspeaker objects as well as the corresponding distances are computed as

ϕp,i=(pp+Rθm,j)-plistrp,i$$ {\phi }_{p,i}=\frac{\left({{p}}_p+R{{\theta }}_{\mathrm{m},j}\right)-{{p}}_{\mathrm{list}}}{{r}_{p,i}} $$(13)

and

rp,i=||(pp+Rθm,j)-plist||,$$ {r}_{p,i}=||\left({{p}}_p+R{{\theta }}_{\mathrm{m},j}\right)-{{p}}_{\mathrm{list}}||, $$(14)

respectively. As described in [30, 31], the VLO gains should depend on the distance

gdis,p,i(r,R)={Rrp,i,for rp,i>Rrp,iR,for rp,iR$$ {g}_{\mathrm{dis},p,i}\left(r,R\right)=\left\{\begin{array}{ll}\frac{R}{{r}_{p,i}},\hspace{1em}& \mathrm{for}\enspace {r}_{p,i}>R\\ \frac{{r}_{p,i}}{R},\hspace{1em}& \mathrm{for}\enspace {r}_{p,i}\le R\\ & \end{array}\right. $$(15)

and in direction parametrically (from unity to cardioid)

gdir,p,i(r)=(1-α2)+α2θm,jTϕp,i,$$ {g}_{\mathrm{dir},p,i}(r)=\left(1-\frac{\alpha }{2}\right)+\frac{\alpha }{2}{{\theta }}_{\mathrm{m},j}^{\mathrm{T}}{\phi }_{p,i}, $$(16)

α=rp,irp,i+Rdir.$$ \alpha =\frac{{r}_{p,i}}{{r}_{p,i}+{R}_{\mathrm{dir}}}. $$(17)

thumbnail Figure 3

The VLO method [30, 31] is visualized in (a) while (b) shows the de-emphasis term introduced in eqaution (18).

Here, the gain (15) attenuates distant VLOs by 1r$ \frac{1}{r}$. Too close ones are attenuated by r to maintain a robust result and erroneous localization. The gain (16) ensures that listeners do not hear sound far behind a VLO, which is always on-axis oriented towards pp, but not abruptly so when walking through the VLO position. And the direction ϕp,i in (13) represents the parallactic displacement at a shifted listening position. The VLO approach achieves an enveloping and spatially plausible reproduction when used with multi-perspective microphone arrays distributed in the recorded scene. There is potential for improvement in the spatial definition of its direct-sound imaging.

For this work, the VLO method is now modified to serve as a residual-signal renderer complementing the objects direct sound signals. Here, the method encodes the 4P residual signals to the listener perspective using (13) as directions in (1). It is however necessary to exclude the direct signals from the VLO signals so the spatial accuracy gained by object signal encoding (Sect. 2.2) is not diminished. To this end, de-emphasis is applied to each microphone-array perspective with the goal of suppressing direct signals that belong to identified sound objects. This is done by gains depending on the object-perspective directions and the object’s direct signal amplitude applied to the signals sp,i(t) for all perspectives p = 1 … P and transducers i = 1 … 4. Extending the gains (15) and (16) introduced in [30, 31] by a de-emphasis term Gp(tθm,i) leads to the new residual VLO gains

gr,p,i=Gp(t,θm,i)gdis,p,i gdir,p,i.$$ {g}_{\mathrm{r},p,i}={G}_p\left(t,{{\theta }}_{\mathrm{m},i}\right){g}_{\mathrm{dis},p,i}\enspace {g}_{\mathrm{dir},p,i}. $$(18)

Figure 3b illustrates the concept of this de-emphasis term suppressing direction towards the objects, which is the product of S directional gain patterns. Each of them is a mixture of unity and cardioid directivity, oriented such that the cardioid suppresses the object signals. It is defined for the array transducer directions θm,i as

Gp(t,θm,i)=s=1SG̃p,s(t,θm,i)$$ {G}_p\left(t,{{\theta }}_{\mathrm{m},i}\right)=\prod_{s=1}^S {\mathop{G}\limits^\tilde}_{p,s}\left(t,{{\theta }}_{\mathrm{m},i}\right) $$(19)

with the single-object de-emphasis pattern

G̃p,s(t,θm,i)=[1-ap,s(t)]+ap,s(1+θop,pT(t)θm,i2)β,$$ {\mathop{G}\limits^\tilde}_{p,s}\left(t,{{\theta }}_{\mathrm{m},i}\right)=\left[1-{a}_{p,s}(t)\right]+{a}_{p,s}{\left(\frac{1+{{\theta }}_{\mathrm{op},p}^{\mathrm{T}}(t){{\theta }}_{\mathrm{m},i}}{2}\right)}^{\beta }, $$(20)

for which the exponent β controls the width of the directional notch and ap,s(t) ∈ [0,1] permits to control the depth of the notch, or to release it for distant or quiet sound-object signals. For this purpose, ap,s(t) is defined depending on the object-perspective distance and moving RMS value of s¯s(t)$ {\bar{s}}_s(t)$ (11) as

ap,s(t)=sRMS,s(t)RMSs gdis(dop,p,Rde),$$ {a}_{p,s}(t)=\frac{{s}_{\mathrm{RMS},s}(t)}{\mathrm{RM}{\mathrm{S}}_s}\enspace {g}_{\mathrm{dis}}\left({d}_{\mathrm{op},p},{R}_{\mathrm{de}}\right), $$(21)

with

sRMS,s(t)=1ΔtRMSt-ΔtRMSts¯s(τ)2dτ,$$ {s}_{\mathrm{RMS},s}(t)=\sqrt{\frac{1}{\Delta {t}_{\mathrm{RMS}}}{\int }_{t-\Delta {t}_{\mathrm{RMS}}}^t {\bar{s}}_s(\tau {)}^2\mathrm{d}\tau }, $$(22)

and gdis(dop,pRde) from 15 using reference distance Rde. The value of the threshold RMSs for each sound object signal is determined by pre-computation given the recordings, or it can be defined manually.

3 Estimation of object positions

Figure 4 provides an overview of the procedure to estimate the sound object positions necessary for the rendering algorithm as introduced in previous sections. Given the frequency-domain microphone array surround signals, the direction-of-arrivals (DOAs) of single-frequency components are estimated and combined into DOA maps. This is similar in concept and application as the DOA histograms in [15, 18] and explained in Section 3.1. Section 3.2 introduces the method to intersect the directional information, to compute combined values, described as the acoustic activity map. Together with the subsequent sequential peak picking algorithm (Sect. 3.3), these concepts are also discussed in [15, 13, 14]. After selection of the instantaneous set of peaks by a sequential algorithm, these are evaluated in terms of probabilistic measures for sound object emergence, continuous activity and false detections. Section 3.7 explains the computation of these measures. They inform the decision making regarding the instantiation of particle filters for peak tracking. We use these particle filters, a well known Monte-Carlo probabilistic estimation technique and application-specifically described in Section 3.5, to track the three-dimensional position of the acoustic activity peaks as a time-coherent trajectory, expanding the method introduced in [22, 23].

thumbnail Figure 4

The procedure of object detection and object position estimation. *I/R signifies the decisions to initialize, continue or remove particle filter instances as introduced in Section 3.4.

Algorithm 1 The computational steps done at each time instant m.

for m = 1 … M do

 (1) Single-perspective DOA maps (Sect. 3.1)

 (2) Multi-perspective acoustic activity map (Sect. 3.2)

 (3) Peak detection (Sect. 3.3)

 (4) Particle filter prediction (Sect. 3.5)

 (5) Object detection (Sect. 3.7)

 (6) Particle filter initialization/deletion (Sect. 3.4)

 (7) Particle filter re-sampling (Sect. 3.5)

End

3.1 Single-perspective DOA map

Direction-of-arrival estimation is applied frame-wise to the surround microphone array signals. The applied approach is the magnitude sensor response method, introduced in [39]. In [40] it was extended to the smoothed magnitude sensor response. It is applied in the frequency domain and done by frequency-wise computation of the covariance matrix Σ(k) of the frequency-domain signal data S(k) for bins k = 1 … K at the time instant m as an average over M time frames

Σ(k)(m)=1Mm=m-M2m+M2-1S(k)[m]S(k)[m]H.$$ {\mathbf{\Sigma }}^{(k)}(m)=\frac{1}{M}\sum_{m\mathrm{\prime}=m-\frac{M}{2}}^{m+\frac{M}{2}-1} {{S}}^{(k)}[m\mathrm{\prime}]\cdot {{S}}^{(k)}[m\mathrm{\prime}{]}^{\mathrm{H}}. $$(23)

of the microphone array frequency bin magnitudes S(k). A subsequent eigenvalue decomposition

Σ(k)=U(k)Λ(k)UH(k)$$ {\mathbf{\Sigma }}^{(k)}={{U}}^{(k)}{\mathbf{\Lambda }}^{(k)}{{U}}^{H(k)} $$(24)

gives us the possibility to further decompose it into signal and noise subspace. This is done by selecting L eigenvalues λl(k)$ {\lambda }_l^{(k)}$ per bin k. As used in [40], this can be a fixed number L or be variable and computed with methods such as SORTE introduced in [41]. The case of a first-order surround signal allows the estimation of a maximum of two directions per frequency bin as it is the case with HARPEX [6]. The eigenvectors u*l(k)$ {{u}}_{\mathrm{*}l}^{(k)}$, the columns of the left eigenvector matrix U(k) corresponding to the selected L eigenvalues, span the signal subspace. These vectors give the estimation of the DOAs

θ̂l(k)=V|u*l(k)|||V|u*l(k)|||for l=1L$$ {\widehat{{\theta }}}_l^{(k)}=\frac{\mathbf{V}|{\mathbf{u}}_{\mathrm{*}l}^{(k)}|}{||\mathbf{V}\left|{\mathbf{u}}_{\mathrm{*}l}^{(k)}\right|||}\hspace{1em}\mathrm{for}\enspace l=1\dots L $$(25)

for each frequency bin k.

In a similar application, [15] uses histograms accumulating DOA estimates. This concept is evolved into a non-discrete map in the spherical harmonics domain. The eigenvalues λl(k)$ {\lambda }_l^{(k)}$ and the corresponding DOAs θ̂l(k)$ {\widehat{{\theta }}}_l^{(k)}$ of all frequency bins are aggregated into a broadband single-perspective DOA map D$ \mathcal{D}$ that is represented by real-valued spherical harmonics (cf. [33]) of the order N

D=k=1Kl=1LyN(θl(k))L(k,λl(k)).$$ \mathcal{D}=\sum_{k=1}^K \sum_{l=1}^L {{y}}_N\left({{\theta }}_l^{(k)}\right)\mathcal{L}\left(k,{\lambda }_l^{(k)}\right). $$(26)

The frequency- and eigenvalue-dependent function

L(k,λ)={k λifkKfs>200 Hz0else$$ \mathcal{L}\left(k,\lambda \right)=\left\{\begin{array}{ll}k\enspace \sqrt{\lambda }& \mathrm{if}\frac{k}{K}{f}_s>200\enspace \mathrm{Hz}\\ 0& \mathrm{else}\\ & \end{array}\right. $$(27)

is used to limit the frequency range and compress large ranges of eigenvalues.

DOA maps according to (26) are continuous and have implicit smoothing that depends on the order, what permits interpolated evaluation at arbitrary directions. This is necessary for the intersection of the single-perspective DOA maps, sampled on a three-dimensional grid.

3.2 Multi-perspective acoustic activity map

The computation of three-dimensional data from the single-perspective DOA maps is based on work in [1315] and further involves spherical harmonic representation/encoding as in Section 3.1 as well as decoding to interpolate for the subsequent computations. The continuous maps (26) obtained allow the computation of values for any direction, and in extension: position. Despite this multi-perspective fusion of DOA maps to sound object position implies somewhat omnidirectional sources, at least statistically, the rendering (Sect. 2) takes source directivity into account, as far as observable from the sparse recording perspectives. It does so in the position-dependent signal extraction (Sect. 2.2 and (10)) by favoring the perspectives best aligned with the radiation directions to the listener.

We introduce a set of positions si with i = 1 … G, which can be regarded to be arbitrary for now and will be used in two different ways below. Then we define the unit directions

θp,i=si-pp||si-pp||.$$ {{\theta }}_{p,i}=\frac{{{s}}_i-{{p}}_p}{||{{s}}_i-{{p}}_p||}. $$(28)

that sample the directions from each recording perspective position pp to every position si in the set. The corresponding directions discretize the DOA map of any perspective by evaluating a spherical harmonic interpolation matrix

Ỹp=[|||yN(θp,1)yN(θp,2)yN(θp,G)|||]$$ {\stackrel{\tilde }{{Y}}}_p=\left[\begin{array}{cccc}|& |& & |\\ {{y}}_N\left({{\theta }}_{p,1}\right)& {{y}}_N\left({{\theta }}_{p,2}\right)& \dots & {{y}}_N\left({{\theta }}_{p,G}\right)\\ |& |& & |\end{array}\right] $$(29)

and hereby discretizing the single-perspective DOA map to display DOA activity for the positions si

wp=ỸpDp.$$ {{w}}_p={\stackrel{\tilde }{{Y}}}_p{\mathcal{D}}_p. $$(30)

These discrete DOA activities are subsequently weighted by a distance factor to give emphasis to perspectives close to the position si,

fp(i)=exp(-||si-pp||2δd).$$ {f}_p(i)=\mathrm{exp}\left(-\frac{{||{{s}}_i-{{p}}_p||}^2}{{\delta }_d}\right). $$(31)

This function is applied element wise to wp, and δd = 2 is the decay factor chosen here.

Next, the distance-weighted single-perspective values are combined by applying an l-norm

γi=[p=1P(fp(i)wp[i])l]1lfor i=1G.$$ {\gamma }_i={\left[\sum_{p=1}^P {\left({f}_p(i){{w}}_p[i]\right)}^l\right]}^{\frac{1}{l}}\hspace{1em}\mathrm{for}\enspace i=1\dots G. $$(32)

The term wp[i] denotes the i-th element of the vector. This yields one value γi for each position si that will be subsequently called the acoustic activity for the positions in question. The acoustic activities {γi} for an entire set of positions {si} are called acoustic activity map, below (Fig. 5).

thumbnail Figure 5

The acoustic activity map evaluated at equidistant grids on horizontal planes for visualization with two visible peaks.

3.3 Peak detection algorithm

We apply peak detection to the acoustic activity map at every time frame, yielding a set of Q peak observations O = [o1oQ] with the corresponding acoustic activity peak values Γq with q = 1 … Q.

The peak detection algorithm is a greedy sequential algorithm similar to [15], however taking advantage of the continuous spherical harmonic DOA maps. We define the set of nodes si (cf. Sect. 2) on an equidistant three-dimensional grid with i = 1 … G positions and a grid spacing of ds, here chosen to be 0.25 m. The nodes are used to evaluate the acoustic activity following the procedure of equations (28)(32). The maximum of these grid values

Γq=maxiγifor i=1G$$ {\mathrm{\Gamma }}_q=\underset{i}{\mathrm{max}}{\gamma }_i\hspace{1em}\mathrm{for}\enspace i=1\dots G $$(33)

is selected. It being the global maximum of the acoustic activity, it likely is the loudest sound object in the scene. The position of the peak, the observation oq, is defined as the grid position corresponding to that maximum

oq=sjwhere j=argmaxi γifor i=1...G.$$ {{o}}_q={{s}}_j\hspace{1em}\mathrm{where}\enspace j=\mathrm{arg}\underset{i}{\mathrm{max}}\enspace {\gamma }_i\hspace{1em}\mathrm{for}\enspace i=1...G. $$(34)

To detect local maxima representing other sound objects and at the same time avoid ghost peaks, we apply peak deletion on the DOA maps after each iteration of peak detection. This removal of peaks is similar to approaches [42, 43] for directional loudness editing in Ambisonics or generalized spherical beamforming [4446]. A particularly suitable suppression function is denoted as

Dp(q+1)=(I-diag{a(q)}yN(θ0,p(q))yN(θ0,p(q))T)Dp(q)$$ {\mathcal{D}}_p^{\left(q+1\right)}=\left(\mathbf{I}-\mathrm{diag}\left\{{{a}}^{(q)}\right\}{{y}}_N({{\theta }}_{0,p}^{(q)}){{y}}_N({{\theta }}_{0,p}^{(q)}{)}^{\mathrm{T}}\right){\mathcal{D}}_p^{(q)} $$(35)

and applied to each single-perspective DOA map Dp$ {\mathcal{D}}_p$. The function zeros the spherical harmonic representation at the direction θ0, which is the perspective-to-peak direction

θ0,p(q)=oq-pp||oq-pp||.$$ {{\theta }}_{0,p}^{(q)}=\frac{{{o}}_q-{{p}}_p}{||{{o}}_q-{{p}}_p||}. $$(36)

This removes an on-axis normalized, order-weighted beam pattern using

a(q)=ayN(θ0,p(q))Tdiag{a}yN(θ0,p(q)),$$ {{a}}^{(q)}=\frac{{a}}{{{y}}_N({{\theta }}_{0,p}^{(q)}{)}^T\mathrm{diag}\left\{{a}\right\}{{y}}_N({{\theta }}_{0,p}^{(q)})}, $$(37)

that is directed towards the peak to be erased. The order weights a are essentially arbitrary, but maxRE-weights [33] of an order slightly smaller than the one of the DOA map proved to work well. This deletion step is done before continuing to detect the next loudest peak, as it minimizes the likelihood of an erroneous detection of ghost peaks associated with sub-optimal ray intersections belonging to the previous peak, cf. Figure 6a. A more elaborate explanation of the deletion function and order weights can be found in [47]. The sequential peak picking is repeated until either a defined number Q of observations is reached or a peak value threshold is reached. The step sequence is presented in Algorithm 2.

thumbnail Figure 6

(a) visualizes the problem of ghost peaks. (b) and (c) show the application of the peak deletion function, which removes directional components from the DOA maps Dp$ {\mathcal{D}}_p$. Intersection of directional information can lead to ghost peaks in three-dimensional data. DOA map before peak deletion. Maximum at (−45, −40). Renormalized DOA map after peak deletion. Maximum at (45, 0).

Algorithm 2 Step sequence of the peak picking and deletion algorithm.

while Peak magnitude is above threshold and maximum number of observations not reached do

  (1) Compute acoustic activity map (32)

 (2) Pick global maximum (33 and grid position 34)

 (3) for each Perspectives do

  (a) Compute direction vector (36)

  (b) Apply peak deletion (35)

end

end

3.4 Peak tracking with particle filters

The peak observations O = [o1oQ] introduced in Section 3.3 are instantaneous for a time instant m. They are most likely noisy and therefore not directly useable for position estimation of the salient sound objects. To compute time-coherent trajectories for this position estimation, particle filters are introduced. Extending the concepts introduced in [2123], particle filters are used in conjunction with a probabilistic detection algorithm, which evaluates the aforementioned peak observations using transitional probabilities. This evaluation involves a procedure to (a) start tracking a detected peak, (b) continue tracking a peak and (c) stop tracking a peak. Each peak considered valuable is associated with its own particle filter instance. Details on such an instance are found in Section 3.5. For all observations and known objects the procedure below is considered, starting at the most prominent peak:

  1. The lifecycle of a tracking instance starts when the value Pq(m)(Hnew)[0,1]$ {P}_q^{(m)}({\mathcal{H}}_{\mathrm{new}})\in \left[\mathrm{0,1}\right]$ for the peak observation oq exceeds a threshold of 0.7. Is this the case, then a particle filter for this peak is initialized at the time instant m.

  2. The continuation of a currently tracked peak is determined by the value Ps(m)[0,1]$ {P}_s^{(m)}\in \left[\mathrm{0,1}\right]$. If it exceeds a threshold of 0.6, then the sound object is deemed existent, active, and observable, and a location estimation is computed for the current time instant m. After first creation of an instance according to (a), the threshold must be exceeded for longer than 0.1 s for the estimation to be considered viable, to avoid spurious detection.

  3. Complementing (b): if Ps(m)$ {P}_s^{(m)}$ falls below the threshold of 0.6 for ≥0.6 s, the peak is deemed vanished and the instance is discontinued.

The procedure to compute Pq(m)(Hnew)$ {P}_q^{(m)}({\mathcal{H}}_{\mathrm{new}})$ and Ps(m)$ {P}_s^{(m)}$ is explained in Section 3.7, but it is valuable to describe the particle filter as applied in this work and its caveats in computation before.

3.5 Particle filter dynamics

A particle filter is an estimation method employing a swarm of particles to estimate statistical measures. Here, it is used to estimate the continuous, true position of a peak in the activity map from the statistical centroid of the particle-swarm positions. The particle swarm random-samples the map around the true peak that it is directed to follow, and each particle has its own inertia and momentum. The concept of particle filters has been applied in estimation problems in audio applications such as in [19, 22, 23]. A non-exhaustive list of literature on particle filter theory is [4855].

Since the peak observations O introduced in Section 3.3 are not time-coherent and their positional accuracy is determined by the grid spacing, the particle filtering approach is used to obtain a consistent and continuous peak position estimation over time.

When a new peak is observed, as described in Section 3.4, an instance of a particle filter is initialized. This is done by sampling N = 100 positions following a multivariate Gaussian distribution

xiN(oq,Σinit),$$ {{x}}_i\sim \mathcal{N}\left({{o}}_q,{\mathbf{\Sigma }}_{\mathrm{init}}\right), $$(38)

centered around the peak observation oq (34). The covariance matrix is chosen, so the sampling range covers the inaccuracies introduced by the grid spacing ds and is defined as

Σinst=diag{[dgrid2dgrid2dgrid2]}.$$ {{\Sigma }}_{\mathrm{inst}}=\mathrm{diag}\left\{\left[\begin{array}{lll}{d}_{\mathrm{grid}}^2& {d}_{\mathrm{grid}}^2& {d}_{\mathrm{grid}}^2\end{array}\right]\right\}. $$(39)

A particle is now defined by a state vector

si=[xiẋi],$$ {{s}}_i=\left[\begin{array}{l}{{x}}_i\\ {\stackrel{\dot }{{x}}}_i\end{array}\right], $$(40)

holding position and velocity, set to zero initially, in three dimensions, and a weight qi determining the importance, as named in [55]. The particle weights qi are the normalized acoustic activity map values

qi=γij=1Nγjfor i=1N,$$ {q}_i=\frac{{\gamma }_i}{\sum_{j=1}^N {\gamma }_j}\hspace{1em}\mathrm{for}\enspace i=1\dots N, $$(41)

where γi are the acoustic map values for the particle positions xi following the procedure from equations (28) to (32) without peak deletion.

The estimation of peak position is done by taking a weighted mean, or centroid,

x̂=i=1Nxiqi$$ \widehat{{x}}=\sum_{i=1}^N {{x}}_i{q}_i $$(42)

of the particle positions, see (40).

The velocity values are not required above, but ensure that the trajectories obtained for the peak positions evolve smoothly over time. To ensure trajectories can be adjusted to reasonable physical qualities, the excitation-damping dynamical model adapted from [19] and [22, 23] is applied to each particle si. The prediction step

si(m)=Msi(m-1)+F(m).$$ {{s}}_i^{(m)}=\mathbf{M}{{s}}_i^{\left(m-1\right)}+{\mathbf{F}}^{(m)}. $$(43)

computes the particle state for time instant m from the state of the previous time instant m + 1. For this, the state-space system from [19] with the time step Δt determined by hopsize is denotes as the system matrix

M=[100Δt000100Δt000100Δt000adyn000000adyn000000adyn],$$ \mathbf{M}=\left[\begin{array}{llllll}1& 0& 0& \Delta t& 0& 0\\ 0& 1& 0& 0& \Delta t& 0\\ 0& 0& 1& 0& 0& \Delta t\\ 0& 0& 0& {a}_{\mathrm{dyn}}& 0& 0\\ 0& 0& 0& 0& {a}_{\mathrm{dyn}}& 0\\ 0& 0& 0& 0& 0& {a}_{\mathrm{dyn}}\\ & & & & & \end{array}\right], $$(44)

as well as the process noise

F(m)N(0,Σpr)$$ {\mathbf{F}}^{(m)}\sim \mathcal{N}\left(\mathbf{0},{\mathbf{\Sigma }}_{\mathrm{pr}}\right) $$(45)

with

Σpr=diag{[000bdynbdynbdyn]}.$$ {\mathbf{\Sigma }}_{\mathrm{pr}}=\mathrm{diag}\left\{\left[\begin{array}{cccccc}0& 0& 0& {b}_{\mathrm{dyn}}& {b}_{\mathrm{dyn}}& {b}_{\mathrm{dyn}}\end{array}\right]\right\}. $$(46)

Including the process noise F is necessary to account for the unpredictability in the trajectory of unknown moving objects. The two factors

adyn=e-αdynΔt,$$ {a}_{\mathrm{dyn}}={\mathrm{e}}^{-{\alpha }_{\mathrm{dyn}}\Delta t}, $$(47)

bdyn=βdyn1-adyn2,$$ {b}_{\mathrm{dyn}}={\beta }_{\mathrm{dyn}}\sqrt{1-{a}_{\mathrm{dyn}}^2}, $$(48)

determine the behavior of the particles. The factors αdyn and βdyn are chosen dependent on the expected object movement. They determine the damping of velocity and the process noise influence on the prediction step. In this application, these values were chosen to be αdyn = 2 and βdyn = 0.04 as the active sound objects are mostly static.

The particle filter dynamics above describes how the particles move, however as the process noise lead to random directions, the particle swarm still needs to be led towards the locations of the greatest acoustic activity. This achieved by importance re-sampling, which is a step that relocates particles of low importance to locations of particles with high importance; a relocated particle continues its motion from its new location. In this work this re-sampling is applied every time instant m, subsequent to the application of the object detection algorithm, which is subsequent to particle position prediction, cf. Algorithm 1. The re-sampling scheme used in this algorithm is the systematic approach introduced in [56, 53]; other methods are multinomial [48, 53], residual [56, 52], and stratified ([50], is done by sampling).

3.6 Anti-causal computation

Since the probabilistic calculations and certain time thresholds of the algorithm impose a lag on the detection of actively trackable sound objects, anti-causal look-ahead processing is introduced. After the first analysis of the microphone array recordings, a part of the process is repeated in the time-reversed direction.

Each object detection event from the causal process is used as starting point for the anti-causal processing. The probabilistic detection algorithm is applied in the same manner as in the causal process, however without the ability to initialize new track instances. Only the continuation or removal of existing instances as introduced in Section 3.3 is applied. The particle filter prediction step itself is applied using a negative time step size Δt → −Δt in (44).

The computation of the prediction step, particle weights, and position estimation is only done for time instants m at which the causal processing has not declared the object as active yet.

3.7 Object detection algorithm

Section 3.3 already introduced the concepts of threshold values determining the initialization, continuation, and removal of tracking instances. These values are computed by a probabilistic algorithm introduced in [21, 22] that is adapted to our three-dimensional application.

The sequential peak detection algorithm introduced in Section 3.3 yields Q instantaneous observations oq (34) and peak activity values Γq (33). These peaks can be caused by objects directly, but are still strongly affected by noise, reflections, discretization artifacts, and other interference. Therefore, each detected peak (observation) is evaluated in terms of its viability for the peak tracking algorithm. To accomplish this, the probability of the observation (detected peak) to existing sound objects needs to be quantified, or the likelihood of it belonging to a new object, or it being a false detection. The local relative peak prominence of each peak observation is the relation between first and therefore maximum peak q = 1 and the remaining peaks q > 1

Pq=ΓqΓ1,$$ {P}_q=\frac{{\mathrm{\Gamma }}_q}{{\mathrm{\Gamma }}_1}, $$(49)

in contrast to empirically defined calculation in [22] or histogram peak values in [23] that are found in literature.

For the subsequent explanation, we assume an active sound scene of s = 1 … S initialized peak tracking instances yielding the position estimates x̂s$ {\widehat{{x}}}_s$ (42). All variables introduced in Section 3.5 now carry the additional index for the tracking instance s. For each sound-object track s, we define the observability as in [22, 23], but denoted differently as

Os(m)=As(m) Es(m).$$ {O}_s^{(m)}={A}_s^{(m)}\enspace {E}_s^{(m)}. $$(50)

for the time instant m. The object activity A(m) and the object existence E(m) are probability values for the object peaks that are being actively tracked. The activity is computed using the first-order Markov model

As(m)=pAÃs(m)+pA¯[1-Ãs(m)]$$ {A}_s^{(m)}={p}_{\mathrm{A}}{\mathop{A}\limits^\tilde}_s^{(m)}+{p}_{\overline{\mathrm{A}}}\left[1-{\mathop{A}\limits^\tilde}_s^{(m)}\right] $$(51)

with the transition probabilities pA = 0.95 and pA¯=0.05$ {p}_{\overline{\mathrm{A}}}=0.05$ representing the state transition from active to active state and inactive to active state, respectively, just as in [22]. The intermediate activity value Ãs(m)$ {\mathop{A}\limits^\tilde}_s^{(m)}$ in (51) is the probabilistic combination

Ãs(m)=[1+(1-As(m))(1-Ps(m-1))+ϵAs(m)Ps(m-1)+ϵ]-1$$ {\mathop{A}\limits^\tilde}_s^{(m)}={\left[1+\frac{\left(1-{A}_s^{(m)}\right)\left(1-{P}_s^{\left(m-1\right)}\right)+\epsilon }{{A}_s^{(m)}{P}_s^{\left(m-1\right)}+\epsilon }\right]}^{-1} $$(52)

including the object tracking probability Ps. The small value ϵ > 0 added to the denominator ensures numerical robustness.

Then, the recursive existence is defined in [22, 23] as the function

E(m)=Ps(m-1)+(1-Ps(m-1))μE(m-1)1-μE(m-1),$$ {E}^{(m)}={P}_s^{\left(m-1\right)}+\left(1-{P}_s^{\left(m-1\right)}\right)\frac{\mu {E}^{\left(m-1\right)}}{1-\mu {E}^{\left(m-1\right)}}, $$(53)

that automatically assumes high values of high tracking probability Ps(m-1)$ {P}_s^{(m-1)}$, or maps small values of Ps(m-1)$ {P}_s^{(m-1)}$ with smoothed falling slopes in E(m), with the smoothing constant μ, which is set to 0.5.

The paper [22] introduces the following hypotheses: the observation with index q is caused (1) by a tracked sound object with index s = 1 … S, denoted as Hs$ {\mathcal{H}}_s$, (2) by a false detection caused by interference, denoted as Hfa$ {\mathcal{H}}_{\mathrm{fa}}$, or (3) by a new sound object, denoted as Hnew$ {\mathcal{H}}_{\mathrm{new}}$.

The set of observations oq with q = 1 … Q has to be mapped to these hypotheses. This results in r = 1 … (S + 2)Q mapping combinations (cf. Fig. 7), which have to be evaluated. Assuming conditional independence of the single observations, this is done by computing the association probabilities

p(r|O(m))=q=1Qp(oq(m)|Hq)p(Hq)$$ p(r|{O}^{(m)})=\prod_{q=1}^Q p({{o}}_q^{(m)}|{\mathcal{H}}_q)p\left({\mathcal{H}}_q\right) $$(54)

for each combination r. The term Hq$ {\mathcal{H}}_q$ is the hypothesis mapped to the observation q for the combination r. The term p(oq(m)|H)$ p({{o}}_q^{(m)}|\mathcal{H})$ describes the likelihood of an observation at the position oq while mapped to a certain hypothesis. It is defined as

p(oq(m)|Hq)={pfa(oq(m))if Hq=Hfapnew(oq(m))if Hq=Hnewp(oq(m)|x̂s(m-1))if Hq=Hs.$$ p({{o}}_q^{(m)}|{\mathcal{H}}_q)=\left\{\begin{array}{ll}{p}_{\mathrm{fa}}\left({{o}}_q^{(m)}\right)& \mathrm{if}{\enspace \mathcal{H}}_q={\mathcal{H}}_{\mathrm{fa}}\\ {p}_{\mathrm{new}}\left({{o}}_q^{(m)}\right)& \mathrm{if}\enspace {\mathcal{H}}_q={\mathcal{H}}_{\mathrm{new}}\\ p({{o}}_q^{(m)}|{\widehat{{x}}}_s^{\left(m-1\right)})& \mathrm{if}\enspace {\mathcal{H}}_q={\mathcal{H}}_s\\ & \end{array}\right.. $$(55)

thumbnail Figure 7

(a) Graphical visualization of possible mapping combinations; (b) with Q = 2 and S = 1, a total number of nine possible combinations exist. The selection introduced with δr, selects all r where the hypothesis is included, i.e. δr,fa would lead to the set of r ∈ {1, 2, 3, 4, 7}.

This involves a priori knowledge. The three-dimensional spatial distribution function pfa(oq) describes the probability of an observation o being a false detection, e.g. volumes blocked by obstacles will have a high probability for false detection. Similarly, pnew(oq) describes the known probability of new sound objects appearing in the scene, e.g. on a stage. Both can be set to a uniform distribution if there is no specific knowledge is available. The function p(oq(m)|x̂s(m-1))$ p({{o}}_q^{(m)}|{\widehat{{x}}}_s^{(m-1)})$ is not two dimensional as in [22, 23], but defined as a three-dimensional multivariate Gaussian distribution

p(oq(m)|x̂s(m-1))=N(oq,x̂s,Σs)$$ p({{o}}_q^{(m)}|{\widehat{{x}}}_s^{(m-1)})=\mathcal{N}\left({{o}}_q,{\widehat{{x}}}_s,{\mathrm{\Sigma }}_s\right) $$(56)

centered at the latest estimate x̂s$ {\widehat{{x}}}_s$ and evaluated at the observation position oq. The covariance is estimated by the particle positions of the corresponding particle filter as

Σs=c Cov{xs,i},$$ {\mathrm{\Sigma }}_{\mathrm{s}}=c\enspace \mathrm{Cov}\left\{{{x}}_{s,i}\right\}, $$(57)

with a scaling, chosen to be c = 4, here.

The second term

p(Hq)={pfa (1-Pq)if Hq=Hfapnew Pqif Hq=HnewPq O(m)if Hq=Hs$$ p\left({\mathcal{H}}_q\right)=\left\{\begin{array}{ll}{\mathrm{p}}_{\mathrm{fa}}\enspace \left(1-{P}_q\right)& \mathrm{if}{\enspace \mathcal{H}}_q={\mathcal{H}}_{\mathrm{fa}}\\ {\mathrm{p}}_{\mathrm{new}}\enspace {P}_q& \mathrm{if}\enspace {\mathcal{H}}_q={\mathcal{H}}_{\mathrm{new}}\\ {P}_q\enspace {O}^{(m)}& \mathrm{if}{\enspace \mathcal{H}}_q={\mathcal{H}}_s\\ & \end{array}\right. $$(58)

describes probabilities incorporating the peak prominence (49) and time instant-wise computation of the observability (50). It additionally introduces the factors pfa = 0.8 and pnew = 0.2 for tuning the impact of the prominence Pq.

The subsequent summation for all combinations containing the hypothesis gives the marginal probabilities of observation-hypothesis mappings. This is computed as

Pq(m)(H)=rδr,H p(r|O(m)),with H{Hfa,Hnew,Hs=1,,Hs=S}$$ {P}_q^{(m)}\left(\mathcal{H}\right)=\sum_r {\delta }_{r,\mathcal{H}}\enspace p(r|{O}^{(m)}),\hspace{1em}\mathrm{with}\enspace \mathcal{H}\in \left\{{\mathcal{H}}_{\mathrm{fa}},{\mathcal{H}}_{\mathrm{new}},{\mathcal{H}}_{s=1},\dots,{\mathcal{H}}_{s=S}\right\} $$(59)

where the Kronecker delta denotes the selection from the set of (S + 2)Q mappings where the hypothesis H$ \mathcal{H}$ is included (see Fig. 7). The value Pq(m)(Hnew)$ {P}_q^{(m)}({\mathcal{H}}_{\mathrm{new}})$ is required for tracking control in Section 3.4. The object tracking probability of each known sound object is evaluated as

Ps(m)=q=1QPq(m)(Hs),$$ {P}_s^{(m)}=\sum_{q=1}^Q {P}_q^{(m)}\left({\mathcal{H}}_s\right), $$(60)

and is also employed in tracking control of Section 3.4.

4 Technical evaluation of object position estimation

The accuracy of sound object detection and positional accuracy was evaluated by simulating sound scene recordings with varying noise interference [47]. The simulated scene was created with the simulation library for MATLAB library MCRoomSim2 [57]. The room is a 6 m × 6 m × 3.5 m box model with default room acoustic settings. For this evaluation, four static sound objects, two male and two female speakers from the EBU SQAM collection [58], are situated in the simulated room at static positions (cf. Tab. 1). Microphone arrays are located at 1 m intervals ranging from 3.5 m to 6.5 m on the x and y coordinate. To evaluate the detection capabilities also covering grids with volumetric extent, vertical layers are positioned between 0.5 m and 2.5 m with 1 m spacing along the z axis. This results in 48 virtual microphone arrays. The simulation models microphone arrays as non-coincident using the array geometry of the Oktava MK4012 [37]. The scenes point of origin is located at the center of the room’s floor.

Table 1

The sound object positions and EBU SQAM collection track numbers. The first 5 s of the signals are used in the simulation.

To measure the accuracy of the algorithm, the simulated scene is analyzed at different levels of noise interference. The SNR here is defined as the relation between the loudest microphone signal and uncorrelated white noise added to all microphone signals independently with the same signal energy. The SNR values for this evaluations were SNR ∈ {9, 12, 15, 18} dB.

4.1 Error measures

Mean Distance Error: The measure is defined as the distance between the ground truth to the nearest sound object. Only 1-to-1 mappings are allowed, therefore if an object is already in use for calculation then the next-nearest will be used if existent. The measure is computed for every sound object j = 1 … 4 and averaged over Ntrial trials and M time frames whenever actively tracked sound objects were found:

MDEj=1Ntrialn=1Ntrial1Mm=1Mmini ||sj,true(m,n)-ŝi(m,n)||.$$ \mathrm{MD}{\mathrm{E}}_j=\frac{1}{{N}_{\mathrm{trial}}}\sum_{n=1}^{{N}_{\mathrm{trial}}} \frac{1}{M}\sum_{m=1}^M \underset{i}{\mathrm{min}}\enspace ||{{s}}_{j,\mathrm{true}}^{\left(m,n\right)}-{\widehat{{s}}}_i^{\left(m,n\right)}||. $$(61)

Activity Error Time: The sound object activity is the time where false positives or false negatives are observed in sound object activity. This is done by summation of time frame lengths where there is no nearest sound object existent or the sound object probability differs from 1. It is a strict measure, for which the resulting error time can be high despite getting accurate results. Here it is used to show the relative improvement with varying noise interference. It is defined as

AETj=m=1Mϕj,active(m)Δt,$$ \mathrm{AE}{\mathrm{T}}_j=\sum_{m=1}^M {\phi }_{j,\mathrm{active}}(m)\Delta t, $$(62)

ϕj,active(m)={Aj,truth(m)(a)|Ps(m)-Aj,truth(m)|(b).$$ {\phi }_{j,\mathrm{active}}(m)=\left\{\begin{array}{ll}{A}_{j,\mathrm{truth}}^{(m)}& (\mathrm{a})\\ |{P}_s^{(m)}-{A}_{j,\mathrm{truth}}^{(m)}|& (\mathrm{b})\end{array}\right.. $$(63)

Case (a) is applied if no nearest sound object could be found that was not assigned yet, while (b) applies whenever the tracked sound object s is nearest to the jth truth value still available. The true value Aj,truth(m)$ {A}_{j,\mathrm{truth}}^{(m)}$ is defined by manual labeling based on the instantaneous magnitude of the sound object.

4.2 Results

The mean distance error (MDE) is low with SNR values at 15 dB and 18 dB, as visualized in Figure 8a. The confidence intervals suggest stable values between 2 and 10 centimeters. On the other hand, large confidence intervals at the smaller SNR values suggest unreliable detection and position estimation. The activity error time (AET), see Figure 8b, shows a similar dependency on SNR confirming the SNR requirement for this system at ≥15 dB.

thumbnail Figure 8

(a) Higher SNR values in scene recordings yield good positional accuracy in object localization between 2 and 10 cm. (b) The AET of the sound objects decreases with higher SNR. At lower SNRs, the confidence intervals suggest again a strong variation in results indicative of unstable measurement results. Shown are mean, and 95% confidence intervals.

In summary, the evaluation showed promising accuracy that led us to confide in rendering that might be able to synthesize a spatial auditory image of detected sound objects that closely resembles the one of the recorded scene.

5 Perceptual evaluation

To evaluate the effectiveness of the proposed procedure of rendering, listening experiments have been conducted. A simple scene consisting of two static active sound objects is analysed. The male speaker (Track Nr. 50) and the piano (Track Nr. 60) of the EBU SQAM collection [58]. Here the time interval [1;8] s (piano) and [3;8] s (speech) are used where start of the speech signal is 2 s delayed behind the piano signal. A second scene is simulated where only one of the objects is present, which in turn is used as the baseline for the rendering using the estimated position data from the two-object scene. The purpose of this approach is to include possible artifacts and inaccuracies stemming from inter-object interference in analysis whilst minimizing complexity for the listeners of the experiments.

The experiment was conducted in two parts. First, static listener perspectives were presented to the listener where the virtual listener is positioned at fixed locations facing the active sound object. The second part presented dynamic perspectives, representing a linear motion of the listener perspective along a pre-defined path. In this case the look direction was varied between four distinct orientations. These pre-defined modes of motion are used for auralization using a binaural decoder, so that the experiment can be conducted using headphones without the need for head/position tracking equipment.

The methodology of the experiment was a MUSHRA-like [59] comparative evaluation asking for the perceptual similarity to the reference, also described as the authenticity. The conditions of the comparison include the simulated reference, the proposed method, two broadband rendering methods, and an anchor.3

The reference is the binaural rendering of the listener perspective as simulated by MCRoomSim [57].

The proposed condition is a binaural rendering following the procedure introduced in Section 2 and Section 3.

The VLO approach was introduced in [30, 31] and is a broadband spatial rendering method to enable acoustic scene playback with spatially distributed surround recordings. Refer to Figure 3a for an overview. The virtual loudspeaker objects are encoded in third-order Ambisonics and decoded to binaural signals with the IEM BinauralDecoder [60].

The Vector-Based Intensity Panning (VBIP) approach is a simple superposition of three surround recordings transformed to first-order Ambisonics signals weighted with the areal coordinate approach. Again, the IEM BinauralDecoder4 was used for decoding to headphone signals. A previous study on the performance of VLO and VBIP as done in [61].

To give a low-rating reference, a mono version of the VBIP condition was added as anchor. This was visually not identifiable and part of the randomly sorted multi-stimulus presentation.

5.1 Experiment

Both the static and dynamic listening tasks used a room measuring 6 m × 6 m × 3.5 m simulated in MCRoomSim with the same grid layout as in the technical evaluation above: equally spaced according to Figure 9 and vertically repeated in three layers at 0.5 m, 1.5 m and 2.5 m height. The simulated arrays model the geometry of the Oktava MK4012 [37].

thumbnail Figure 9

The perspectives used in the listening experiment. (a) The static perspective positions and the trajectory of the dynamic perspective used in the listening experiment. (b) Lists coordinates of static perspective positions and path start and end point.

The static-perspective tasks evaluated four listener positions that are visualized in Figure 9. Such a task consisted of a the comparative rating of authenticity where the reference was always visible to the listener and the order of conditions was randomized and not visually identifiable. Each static position was rated twice by each participant, also randomly sequenced within the set of multi-stimulus tasks.

Further the dynamic-perspective tasks consisted of the comparative rating of authenticity with visible reference. The four look directions A, B, C, D shown in Figure 9 were evaluated twice by each participant, in randomized condition and task orders.

The signals were optimized for the most common AKG and Beyerdynamic high end models to minimize coloration. In total, 16 expert listeners aged between 24 and 39 (average age: 29) took part in the listening experiment taking 30 min on average to complete it. The pairwise statistical significance was assessed with a Wilcoxon signed rank test [62] with Bonferroni-Holm correction [63]. There were 32 responses for each condition yielding 4 × 32 = 128 responses when merging over positions 1 to 4 in part 1 and look directions A–D in part 2. The data proved to be consistent enough to be merged.

All plots in Figure 10 show the sample median of collected sample populations and ≥95% confidence intervals. The choice of the median over the mean is based on its higher robustness towards outliers, especially relatively small sample populations. The confidence intervals are computed by applying the binomial test [64, 65] to the samples of each evaluated condition.

thumbnail Figure 10

The results of the second experiment comparing dynamic perspectives for (a) single look direction data and conditions as well as the (b) merged-look-direction data. Pictured are the sample median, the ≥95% confidence intervals and notable statistical significance is marked.

5.2 Results

Static-perspective tasks: The single position ratings are visualized in Figure 10a (Median and ≥95% confidence intervals). Participants rated the proposed approach higher than all the other conditions, with statistical significance (p < 0.001) at positions 1–3. At position 4, however, VLO rendering shows significantly higher rating (p < 0.001) than VLO at 1–3, and no significant difference to the proposed one (p = 0.0541). The comparison of merged VLO data from Positions 1 and 3 on the one hand and 2 and 4 on the other hand (not displayed) exhibits an advantage (p < 0.001) of the direct perspective positions over the interpolated ones. The ratings of the VBIP approach show a decrease with distance, when ratings are compared with locations in Figure 9. The difference is significant when comparing the farthest and closest position (p < 0.001).

Dynamic-perspective tasks: Figure 10c shows, the ratings for all conditions are very similar over look directions A–D suggesting that look direction is of little influence. Moreover, the proposed and VLO methods are consistently rated higher (p < 0.001) than the VBIP condition. Between proposed and VLO, the advantage is not as strong but existent at all look direction A (p = 0.0696), B (p = 0.0218), C (p < 0.001) or D (p = 0.0650). This experiment supports the findings of [61] where the VLO approach performed better than the VBIP approach.

The merged responses across all directions of the dynamic-perspective experiment (Fig. 10d) imply a significant mean difference (p < 0.001) between ratings of proposed and VLO/VBIP, supporting the results of the static-perspective experiment (Fig. 10b).

5.3 Discussion

This listening evaluation confirmed the intended improvements in scene resynthesis by comparison with established broadband methods. Despite the limited sample size, most of the statistical results were significant. The static and dynamic-perspective parts of the experiment indicated a significant increase in authenticity of the proposed methods over the compared ones.

For the static-perspective part, the authenticity, i.e. the similarity to the reference, of the proposed is consistently rated high, and in the majority of the cases higher than the alternatives. As the one exception, the observed drop off at position 4 is most likely due to the high rating of VLO and combined with the limited scale of the ratings. Further, VLO shows ratings seemingly depending on the listener position and the significant increase in rating at position 4 is due to the fact that this position is a direct microphone array position and far away from a source. There, the surround perspective of the microphone position provides accurate reproduction just as with VBIP, however providing a better room impression owed to the rich diversity of VLOs and their directions. By contrast, spatial reproduction with VLO and VBIP suffers at interpolated, less diffuse listener perspectives.

The dynamic-perspective part of the experiment shows an increase in ratings for VLO as the interpolation is perceived smoother and is rated almost as high as proposed. As residual signals of proposed are based on the VLO approach, smoothness and general impression are understandably similar whenever the virtual listener has moved away from the sources. The advantages lie in auditory localization of direct sounds through the proposed object direct signal extraction and encoding. The data shows this improvement in the single-direction as well as the merged-data comparisons.

6 Conclusion

In this contribution we proposed an analysis and resynthesis method for acoustic scenes recorded with distributed surround microphone arrays, in the investigated case tetrahedral A-format Ambisonics microphones. We could show that multi-perspective recordings provide sufficiently much additional information for rendering with significantly improved spatial accuracy and authenticity, already when performed broadband, in the time domain, only. This effectively avoids any risk of introducing musical-noise artifacts that any potentially more effective time-frequency processing intrinsically bears.

A numerical experiment considered sound objects of a simulated scene and could prove good accuracy in object position and signal activity estimation, and it revealed a 15 dB SNR or direct to diffuse ratio limit that local microphones around the active sound object should be able to satisfy. We could further verify the suspected improvement compared to two known broadband perspective interpolation approaches in a two-part listening experiment; its results show improvements in authenticity and spatial definition.


1

MATLAB implementation Spherical-Harmonic-Transform by Archontis Politis available at https://github.com/polarch/Spherical-Harmonic-Transform.

3

Conditions available at https://phaidra.kug.ac.at/o:107325.

4

Plugin available at https://plugins.iem.at/.

References

  1. T. Pihlajamäki, V. Pulkki: Projecting simulated or recorded spatial sound onto 3d-surfaces, in AES Conference: 45th International Conference: Applications of Time-Frequency Processing in Audio, 03 2012. Available: http://www.aes.org/e-lib/browse.cfm?elib=16198. [Google Scholar]
  2. T. Pihlajamäki, V. Pulkki: Synthesis of complex sound scenes with transformation of recorded spatial sound in virtual reality. Journal of the Audio Engineering Society 63 (2015) 542–551. Available: http://www.aes.org/e-lib/browse.cfm?elib=17840. [Google Scholar]
  3. V. Pulkki: Directional audio coding in spatial sound reproduction and stereo upmixing, in AES Conference: 28th International Conference: The Future of Audio Technology – Surround and Beyond, 06, 2006. Available: http://www.aes.org/e-lib/browse.cfm?elib=13847. [Google Scholar]
  4. V. Pulkki, A. Politis, M.-V. Laitinen, J. Vilkamo, J. Ahonen: First-order directional audio coding (dirac). Parametric Time-Frequency Domain Spatial Audio 10 (2017) 89–140. https://doi.org/10.1002/9781119252634.ch5. [Google Scholar]
  5. A. Plinge, S.J. Schlecht, O. Thiergart, T. Robotham, O. Rummukainen, E.A.P. Habets: Six-degrees-of-freedom binaural audio reproduction of first-order ambisonics with distance information, in: AES International Conference on Audio for Virtual and Augmented Reality 08 2018). Available: http://www.aes.org/e-lib/browse.cfm?elib=19684. [Google Scholar]
  6. N. Barrett, S. Berge: A new method for b-format to binaural transcoding, in Audio Engineering Society Conference: 40th International Conference: Spatial Audio: Sense the Sound of Space, 10, 2010. Available: http://www.aes.org/e-lib/browse.cfm?elib=15527. [Google Scholar]
  7. E. Stein, M.M. Goodwin: Ambisonics depth extensions for six degrees of freedom, in AES Conference: 2019 AES International Conference on Headphone Technology, 08 2019, Available: http://www.aes.org/e-lib/browse.cfm?elib=20514. [Google Scholar]
  8. A. Allen, B. Kleijn: Ambisonic soundfield navigation using directional decomposition and path distance estimation, in ICSA, Graz, Austria, 09 2017. [Google Scholar]
  9. M. Kentgens, A. Behler, P. Jax: Translation of a higher order ambisonics sound scene based on parametric decomposition, in IEEE ICASSP (2020) 151–155. https://doi.org/10.1109/ICASSP40776.2020.9054414. [Google Scholar]
  10. L. Birnie, T. Abhayapala, P. Samarasinghe, V. Tourbabin: Sound field translation methods for binaural reproduction, in IEEE WASPAA (2019) 140–144. Available: https://doi.org/10.1109/WASPAA.2019.8937274. [Google Scholar]
  11. E. Bates, H. O’Dwyer, K.-P. Flachsbarth, F.M. Boland: A recording technique for 6 degrees of freedom VR, in AES Convention, Vol. 144. Audio Engineering Society, 05 2018. Available: http://www.aes.org/e-lib/browse.cfm?elib=19418 [Google Scholar]
  12. H. Lee: A new multichannel microphone technique for effective perspective control, in AES Convention, Vol. 140. Audio Engineering Society, 05 2011. Available: https://www.aes.org/e-lib/browse.cfm?elib=15804. [Google Scholar]
  13. A. Brutti, M. Omologo, P. Svaizer: Localization of multiple speakers based on a two step acoustic map analysis. IEEE ICASSP (2008) 4349–4352. Available: https://doi.org/10.1109/ICASSP.2008.4518618. [Google Scholar]
  14. A. Brutti, M. Omologo, P. Svaizer: Multiple source localization based on acoustic map de-emphasis. EURASIP Journal on Audio, Speech, and Music Processing 2010 (2010). 147495. https://doi.org/10.1155/2010/147495. [Google Scholar]
  15. P. Hack, Multiple source localization with distributed tetrahedral microphone arrays. Master’s Thesis, Institute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Graz, Austria, 2015. Available: http://phaidra.kug.ac.at/o:12797 [Google Scholar]
  16. G. Del Galdo, O. Thiergart, T. Weller, E.A. Habets: Generating virtual microphone signals using geometrical information gathered by distributed arrays, in 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays, IEEE, 05 2011. Available: https://doi.org/10.1109. [Google Scholar]
  17. O. Thiergart, G. Del Galdo, M. Taseska, E.A.P. Habets: Geometry-based spatial sound acquisition using distributed microphone arrays. IEEE Transactions on Audio, Speech, and Language Processing 21 (2013) 2583–2594. https://doi.org/10.1109/TASL.2013.2280210. [Google Scholar]
  18. X. Zheng: Soundfield navigation: Separation, compressionand transmission. Ph.D. Dissertation, University of Wollongong, 2013. Available: https://ro.uow.edu.au/theses/3943/. [Google Scholar]
  19. D.B. Ward, E.A. Lehmann, R.C. Williamson: Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Transactions on Speech and Audio Processing 11 (2003) 11 https://doi.org/10.1109/TSA.2003.818112. [Google Scholar]
  20. M.F. Fallon, S.J. Godsill: Acoustic source localization and tracking of a time-varying number of speakers. IEEE Transactions on Audio, Speech, and Language Processing 20 (2012) 1409–1415. https://doi.org/10.1109/TASL.2011.2178402. [Google Scholar]
  21. J.-M. Valin, F. Michaud, J. Rouat: Robust 3D localization and tracking of sound sources using beamforming and particle filtering. IEEE ICASSP (2006). https://doi.org/10.1109/ICASSP.2006.1661100. [Google Scholar]
  22. J.-M. Valin, F. Michaud, J. Rouat: Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Elsevier Science 55 (2007) 216–228. Available: https://arxiv.org/pdf/1602.08139.pdf. [Google Scholar]
  23. S. Kitić, A. Guérin: Tramp: Tracking by a real-time ambisonic-based particle filter, in LOCATA Challenge Workshop, 09 2018. Available: https://arxiv.org/abs/1810.04080. [Google Scholar]
  24. J.G. Tylka, E. Choueiri. Soundfield navigation using an array of higher-order ambisonics microphones, in AES International Conference on Audio for Virtual and Augmented Reality, 09 (2016). Available: http://www.aes.org/e-lib/browse.cfm?elib=18502. [Google Scholar]
  25. J.G. Tylka, E.Y. Choueiri: Domains of practical applicability for parametric interpolation methods for virtual sound field navigation. Journal of the Audio Engineering Society 67 (2019) 882–893. Available: http://www.aes.org/e-lib/browse.cfm?elib=20702. [CrossRef] [Google Scholar]
  26. J.G. Tylka: Virtual navigation of ambisonics-encoded sound fields containing near-field sources. PhD dissertation, Princeton University, 2019. Available: http://arks.princeton.edu/ark:/88435/dsp011544br958. [Google Scholar]
  27. N. Mariette, B.F.G. Katz, K. Boussetta, O. Guillerminet: Sounddelta: A study of audio augmented reality using wifi-distributed ambisonic cell rendering. AES Convention, Vol. 128. Audio Engineering Society, 2010. Available: http://www.aes.org/e-lib/browse.cfm?elib=15420. [Google Scholar]
  28. C. Schörkhuber, R. Höldrich, F. Zotter: Triplet-based variable-perspective (6DoF) audio rendering from simultaneous surround recordings taken at multiple perspectives, in Fortschritte der Akustik (DAGA), Hannover, Germany, 04 2020. Available: https://pub.dega-akustik.de/DAGA_2020/data/articles/000295.pdf. [Google Scholar]
  29. E. Patricio, A. Rumiński, A. Kuklasiński, L. Januszkiewicz, T. Żernicki: Toward six degrees of freedom audio recording and playback using multiple ambisonics sound fields. AES Convention, Vol. 146, Audio Engineering Society, 2019. Available: http://www.aes.org/e-lib/browse.cfm?elib=20274. [Google Scholar]
  30. P. Grosche, F. Zotter, C. Schörkhuber, M. Frank, R. Höldrich: Method and apparatus for acoustic scene playback. Patent WO2018077379A1 (2018). Available: https://patents.google.com/patent/WO2018077379A1. [Google Scholar]
  31. F. Zotter, M. Frank, C. Schörkhuber, R. Höldrich: Signal-independent approach to variable-perspective (6DoF) audio rendering from simultaneous surround recordings taken at multiple perspectives, in Fortschritte der Akustik (DAGA), Hannover, Germany. 04 2020. Available: https://pub.dega-akustik.de/DAGA_2020/data/articles/000458.pdf. [Google Scholar]
  32. D. Rivas Méndez, C. Armstrong, J. Stubbs, M. Stiles, G. Kearney: Practical recording techniques for music production with six-degrees of freedom virtual reality. AES Convention, Vol. 145, Audio Engineering Society, 2015. Available: http://www.aes.org/e-lib/browse.cfm?elib=19729. [Google Scholar]
  33. F. Zotter, M. Frank: Ambisonics, 1st edn., Vol. 19 of Springer Topics in Signal Processing, Springer International Publishing, 2019. https://doi.org/10.1007/978-3-030-17207-7. [CrossRef] [Google Scholar]
  34. A. Politis: Microphone array processing for parametric spatial audio techniques. PhD dissertation, Aalto University, 2016. Available: http://urn.fi/URN:ISBN:978-952-60-7037-7. [Google Scholar]
  35. J. Ivanic, K. Ruedenberg: Rotation matrices for real spherical harmonics. direct determination by recursion. The Journal of Physical Chemistry 100 (1996) 6342–6347. https://doi.org/10.1021/jp953350u. [Google Scholar]
  36. C. Schörkhuber, M. Zaunschirm, R. Höldrich: Binaural rendering of ambisonic signals via magnitude least squares, in Fortschritte der der Akustik (DAGA), Munich, Germany, 03 2018. Available: https://pub.dega-akustik.de/DAGA_2018/data/articles/000301.pdf. [Google Scholar]
  37. Oktava GmbH: Oktava mk-4012 (2019). Available: http://www.oktava-shop.com/images/product_images/popup_images/4012.jpg. [Google Scholar]
  38. E. Hille: Analytic Function Theory, 2nd edn., Vol. 1. Chelsea Publishing Company, New York, 1982. [Google Scholar]
  39. A. Politis, S. Delikaris-Manias, V. Pulkki: Direction-of-arrival and diffuseness estimation above spatial aliasing for symmetrical directional microphone arrays. IEEE ICASSP (2015) 6–10. https://doi.org/10.1109/ICASSP.2015.7177921. [Google Scholar]
  40. T. Wilding: System parameter estimation of acoustic scenes using first order microphones, Master’s thesis. Institute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Graz, Austria, 2016. Available: http://phaidra.kug.ac.at/o:40685. [Google Scholar]
  41. Z. He, A. Cichocki, S. Xie, K. Choi: Detecting the number of clusters in n-way probabilistic clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2010) 2006–2021. https://doi.org/10.1109/TPAMI.2010.15. [CrossRef] [PubMed] [Google Scholar]
  42. M. Kronlachner: Spatial transformations for the alteration of ambisonic recordings. Master’s thesis (2014). Available: http://phaidra.kug.ac.at/o:8569. [Google Scholar]
  43. M. Hafsati, N. Epain, J. Daniel: Editing ambisonics sound scenes. ICSA, Graz, Austria, 09 2017. [Google Scholar]
  44. M. Jeffet, B. Rafaely: Study of a generalized spherical array beamformer with adjustable binaural reproduction (2014) 77–81. https://doi.org/10.1109/HSCMA.2014.6843255. [Google Scholar]
  45. N. Shabtai, B. Rafaely: Generalized spherical array beamforming for binaural speech reproduction. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (2014) 238–247. https://doi.org/10.1109/TASLP.2013.2290499. [Google Scholar]
  46. M. Jeffet, N. Shabtai, B. Rafaely: Theory and perceptual evaluation of the binaural reproduction and beamforming tradeoff in the generalized spherical array beamformer. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (2016) 708–718. https://doi.org/10.1109/TASLP.2016.2522649. [Google Scholar]
  47. M. Blochberger: Multi-perspective scene analysis from tetrahedral microphone recordings. Master’s thesis (2020). Available: https://phaidra.kug.ac.at/o:104549. [Google Scholar]
  48. B. Efron, R. Tibshirani: An Introduction to the Bootstrap, Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 1994. Available: https://books.google.at/books?id=gLlpIUxRntoC. [Google Scholar]
  49. J.S. Liu, R. Chen: Blind deconvolution via sequential imputations. Journal of the American Statistical Association 90 (1995) 567–576. https://doi.org/10.1080/01621459.1995.10476549. [Google Scholar]
  50. P. Fearnhead: Sequential monte carlo methods in filter theory. PhD dissertation, University of Oxford, 1998. [Google Scholar]
  51. G. Kitagawa: Monte carlo filter and smoother for non-gaussian nonlinear state space models. Journal of Computational and Graphical Statistics 5 (1996) 1–25. https://doi.org/10.1080/10618600.1996.10474692. [Google Scholar]
  52. J. Liu, R. Chen: Sequential Monte Carlo methods for dynamic systems. Journal of the American Statistical Association 93 (1998) 1032–1044. https://doi.org/10.1080/01621459.1998.10473765. [Google Scholar]
  53. J. Carpenter, P. Clifford, P. Fearnhead: An improved particle filter for non-linear problems. IEE Proceedings Radar Sonar and Navigation 146 (1999) 2–7. https://doi.org/10.1049/ip-rsn:19990255. [Google Scholar]
  54. A. Doucet, N. de Freitas, N. Gordon: Sequential Monte Carlo Methods in Practice, 1st edn., Information Science and Statistics. Springer-Verlag, New York, 2001. https://doi.org/10.1007/978-1-4757-3437-9. [CrossRef] [Google Scholar]
  55. S. Särkkä: Bayesian Filtering and Smoothing, Institute of Mathematical Statistics Textbooks. Cambridge University Press, 2013. https://doi.org/10.1017/CBO9781139344203. [Google Scholar]
  56. D. Whitley: A genetic algorithm tutorial. Statistics and Computing 4 (1994) 65–85. https://doi.org/10.1007/BF00175354. [Google Scholar]
  57. A. Wabnitz, N. Epain, C. Jin, A. van Schaik: Room acoustics simulation for multichannel microphone arrays. ISRA, Melbourne, Australia, 08 2010. Available: https://www.acoustics.asn.au/conference_proceedings/ICA2010/cdrom-ISRA2010/Papers/P5d.pdf. [Google Scholar]
  58. EBU: Sound Quality Assessment Material recordings for subjective tests, 2008. Available: https://tech.ebu.ch/publications/sqamcd. [Google Scholar]
  59. ITU, ITU-R BS.1534-3: Method for the subjective assessment of intermediate quality level of audio systems, 2015. Available: https://www.itu.int/rec/R-REC-BS.1534-3-201510-I. [Google Scholar]
  60. D. Rudrich: IEM Plugin Suite. IEM, 2019. Available: https://plugins.iem.at/. [Google Scholar]
  61. D. Rudrich, F. Zotter, M. Frank: Evaluation of interactive localization in virtual acoustic scenes. Fortschritte der Akustik (DAGA), Kiel, Germany, 09 2017. Available: https://pub.dega-akustik.de/DAGA_2017/data/articles/000182.pdf. [Google Scholar]
  62. F. Wilcoxon: Individual comparisons by ranking methods. Biometrics Bulletin 1 (1945) 80–83. Available: http://www.jstor.org/stable/3001968. [Google Scholar]
  63. S. Holm: A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6 (1979) 65–70. Available: http://www.jstor.org/stable/4615733. [Google Scholar]
  64. D. Altman, D. Machin, T. Bryant, M. Gardner: Statistics with Confidence, Confidence Intervals and Statistical Guidelines, 2nd edn., BMJ Books (2000). [Google Scholar]
  65. M. Eid, M. Gollwitzer, M. Schmitt: Statistik und Forschungsmethoden, 5th edn. Julius Beltz, 2017. [Google Scholar]

Cite this article as: Blochberger M & Zotter F. 2021. Particle-filter tracking of sounds for frequency-independent 3D audio rendering from distributed B-format recordings. Acta Acustica, 5, 20.

All Tables

Table 1

The sound object positions and EBU SQAM collection track numbers. The first 5 s of the signals are used in the simulation.

All Figures

thumbnail Figure 1

Diagram of the rendering algorithm. Position data (cf. Sect. 3) of sound objects is used to render the Ambisonics listener perspective using real time tracking of the listener position and head rotation.

In the text
thumbnail Figure 2

Three perspective extraction.

In the text
thumbnail Figure 3

The VLO method [30, 31] is visualized in (a) while (b) shows the de-emphasis term introduced in eqaution (18).

In the text
thumbnail Figure 4

The procedure of object detection and object position estimation. *I/R signifies the decisions to initialize, continue or remove particle filter instances as introduced in Section 3.4.

In the text
thumbnail Figure 5

The acoustic activity map evaluated at equidistant grids on horizontal planes for visualization with two visible peaks.

In the text
thumbnail Figure 6

(a) visualizes the problem of ghost peaks. (b) and (c) show the application of the peak deletion function, which removes directional components from the DOA maps Dp$ {\mathcal{D}}_p$. Intersection of directional information can lead to ghost peaks in three-dimensional data. DOA map before peak deletion. Maximum at (−45, −40). Renormalized DOA map after peak deletion. Maximum at (45, 0).

In the text
thumbnail Figure 7

(a) Graphical visualization of possible mapping combinations; (b) with Q = 2 and S = 1, a total number of nine possible combinations exist. The selection introduced with δr, selects all r where the hypothesis is included, i.e. δr,fa would lead to the set of r ∈ {1, 2, 3, 4, 7}.

In the text
thumbnail Figure 8

(a) Higher SNR values in scene recordings yield good positional accuracy in object localization between 2 and 10 cm. (b) The AET of the sound objects decreases with higher SNR. At lower SNRs, the confidence intervals suggest again a strong variation in results indicative of unstable measurement results. Shown are mean, and 95% confidence intervals.

In the text
thumbnail Figure 9

The perspectives used in the listening experiment. (a) The static perspective positions and the trajectory of the dynamic perspective used in the listening experiment. (b) Lists coordinates of static perspective positions and path start and end point.

In the text
thumbnail Figure 10

The results of the second experiment comparing dynamic perspectives for (a) single look direction data and conditions as well as the (b) merged-look-direction data. Pictured are the sample median, the ≥95% confidence intervals and notable statistical significance is marked.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.