Spatial audio signal processing for binaural reproduction of recorded acoustic scenes – review and challenges

– Spatial audio has been studied for several decades, but has seen much renewed interest recently due to advances in both software and hardware for capture and playback, and the emergence of applications such as virtual reality and augmented reality. This renewed interest has led to the investment of increasing efforts in developing signal processing algorithms for spatial audio, both for capture and for playback. In particular, due to the popularity of headphones and earphones, many spatial audio signal processing methods have dealt with binaural reproduction based on headphone listening. Among these new developments, processing spatial audio signals recorded in real environments using microphone arrays plays an important role. Following this emerging activity, this paper aims to provide a scienti ﬁ c review of recent developments and an outlook for future challenges. This review also proposes a generalized framework for describing spatial audio signal processing for the binaural reproduction of recorded sound. This framework helps to understand the collective progress of the research community, and to identify gaps for future research. It is composed of ﬁ ve main blocks, namely: the acoustic scene, recording, processing, reproduction, and perception and evaluation. First, each block is brie ﬂ y presented, and then, a comprehensive review of the processing block is provided. This includes topics from simple binaural recording to Ambisonics and perceptually motivated approaches, which focus on careful array con ﬁ guration and design. Beamforming and parametric-based processing afford more ﬂ exible designs and shift the focus to processing and modeling of the sound ﬁ eld. Then, emerging machine-and deep-learning approaches, which take a further step towards ﬂ exibility in design, are described. Finally, speci ﬁ c methods for signal transformations such as rotation, translation and enhancement, enabling additional ﬂ exibility in reproduction and improvement in the quality of the binaural signal, are presented. The review concludes by highlighting directions for future research.


Introduction
Binaural reproduction of acoustic scenes refers to the playback of sound at the listener's ears in a way that recreates a real-world listening experience of the scene. Ideally, the sound scene reproduced at another time and/or place should be perceptually indistinguishable from the real scene. Some important examples include capture and subsequent reproduction of musical performances or social events, as well as real-time video conferencing with immersive spatial audio.
Headphone-based playback of binaural sound, dating back to the 19th century [1], has become highly popular in recent decades with the availability of personal headphones. This also led to the rise in popularity of headphone-based binaural reproduction, and particularly, the reproduction of recorded acoustic scenes. The latter was initially based on binaural recording, using microphones placed at the ears of a manikin [2]. While providing an impressive spatial audio experience, binaural recording generally does not support listener individualization and head tracking, which are important for creating a realistic acoustic scene through headphone listening [3,4]. The flexibility required for individualization and head-tracking was later obtained with the soundfield microphone and the Ambisonics spatial audio format [5]; these greatly advanced the recording and reproduction of real sound scenes through the separation of the recorded sound as captured by the microphone and the effect of the head on the signal at the ears, represented by the head-related transfer function (HRTF). Ambisonics was then extended to high-order Ambisonics [6][7][8][9] recorded by spherical microphone arrays [10,11], providing higher spatial detail by supporting more recording channels. The seamless incorporation of HRTF into Ambisonics generated a remarkable listening experience within an elegant mathematical setting. Indeed, Ambisonics and HRTF have been the topic of extensive research in the past two decades, supporting a wide range of applications and research areas. For example, listening to sounds generated in simulated or measured acoustic spaces has been studied under auralization [12], investigating the listening experience from a human hearing perspective [13,14]. The theory and practice of spatial audio recording and reproduction [15], and particularly Ambisonics [16], have been established, supported by advancements in spherical microphone array design and processing [17,18]. New approaches to spatial audio processing and coding are still being proposed [19][20][21], facilitated by improved ways for headphone listening [22,23]. However, in spite of these impressive advances over the past few decades, new emerging applications raise entirely new challenges for spatial audio in general, and binaural reproduction of recorded scenes, in particular.
A set of such emerging technologies that provides a new exciting platform for binaural reproduction applications is virtual reality (VR), augmented reality (AR), and mixed reality (MR) [12,24,25]. These originated from gaming, and have now been expanded to multimedia, education, personal communication, and virtual meetings, among many other areas. The new platforms introduce a unique set of challenges imposed by the fact that, in many cases, audio is captured by microphones that are embedded in consumer devices, which are often wearable. This is particularly challenging for the reproduction of recorded acoustic scenes. The first challenge is space and hardware limitations, which has led to the deployment of a small number of microphones of arbitrary arrangement, and often with unfavorable spatial diversity. Examples of these devices are mobile phones, laptops, smart speakers and VR headsets. These devices also introduce other challenges, imposed both by the motion of wearable and mobile arrays during signal acquisition [26], which hinders a stable listening experience of the reproduced scene, and by a low-latency constraint that occurs in applications involving real-time interactions, such as virtual video conferencing. In addition, acoustic scenes recorded by these devices may contain environmental noise and interfering sound, superimposed on the desired sound such as speech and music, which may degrade a virtual meeting, for example.
In synergy with the emerging technologies and applications, new directions in spatial audio signal processing are evolving that attempt to overcome the challenges mentioned above, and more. The aim of this review paper is to provide an updated account of these emerging methods, published in the past few years, and propose directions for future research. The paper first introduces a generalized framework for binaural reproduction of recorded acoustic scenes, then focuses on processing approaches, and concludes with prospects for future research. Regarding processing approaches, this paper first presents approaches that consider the microphone array as the dominant design element, and therefore require very specific microphone array designs. In binaural recording, two microphones are placed at the ears of a dummy head, while in Ambisonics, a dedicated array must be designed to capture spherical harmonics signals. In perceptual-based arrays, the microphones and their arrangement are by-design carefully configured to produce perceptually useful signals. Next, beamforming-based processing makes a step forward by lifting the constraints on array configuration, thus allowing a flexible design. Spatial filters, or beamformers, designed specifically for the array at hand, form the basis of the approach. This is then followed by parametric approaches, where a further step is made from array-focused methods to methods that exploit information in the sound field. The information is modeled and the model parameters are estimated, providing the basis for the spatial reproduction. Finally, machine-and deep-learning approaches provide an even more flexible framework that can exploit information both in the array configuration and in the sound field. Transformations such as rotation, translation and signal enhancement, tailored to the signal processing approaches, are then presented, followed by conclusions and an outlook for the future.

Overview
This section presents an overview of the entire process comprising binaural reproduction of recorded acoustic scenes. A generalized framework that encapsulates this process is first presented, from the acoustic scene being recorded to the perception and evaluation of the reproduced spatial audio. Each part of this process is reviewed in the following subsections, while processing approaches are reviewed in greater detail in the subsequent sections.

Generalized framework
The generalized framework of spatial audio signal processing for the binaural reproduction of recorded acoustic scenes is presented in Figure 1. The process presented in the figure starts from the acoustic scenethe real-world environment within which the sound is generated. This could be a concert hall with music sounds, an office with speech sounds, an outdoor environment with street sounds, and other scenes. A recording device, such as a microphone array of any type that is positioned in the scene, produces recorded audio signals. The recording device can be anything from a dummy head directly recording binaural signals, to spherical arrays or arrays of other configurations. Processing is then applied to the recorded audio signals in preparation for reproduction; this stage is the main focus of this review paper and includes a wide range of spatial audio signal processing methods, from Ambisonics, through parametric audio, to deep learning. Note that Figure 1 shows another optional layer behind processing labeled transformations, which includes enhancement, rotation and translation. After processing, the spatial audio signal is ready for reproductionthis paper focuses on headphone reproduction, which is widely used in many applications. Finally, the headphone signals are perceived by listeners, or can be evaluated objectively; this is the final block of the framework, and is labeled perception and evaluation. More details on each block of the framework are presented in the following subsections.

Acoustic scenes
Spatial sound recording and binaural reproduction have found numerous applications in a large variety of acoustic scenes, ranging from relatively small indoor spaces to expansive outdoor areas. The indoor examples include offices and meeting rooms, where binaural reproduction has been employed for teleconferencing applications [27, 28]; these have recently received increased attention due to growing popularity of VR and AR platforms and the mushrooming of distance working/learning in response to the Covid pandemic. The acoustic source type of particular interest in this application is human speech. Another category of indoor acoustic scenes that has received significant attention in the past few decades is concert halls. The applications include recordings of music or other artistic performances [29], and perceptual assessment and comparison of concert hall sound [30,31]. These applications are usually characterized by elevated reverberation, and the acoustic sources of interest are primarily musical instruments and human voices. Multiple outdoor applications have also been explored. For example, spatial recordings have been utilized to capture urban sounds, including traffic, subway stations, and social gatherings; these were used to facilitate perceptual soundscape studies using various reproduction methods [32,33]. Finally, spatial sound recording methods have also been proposed for use in open outdoor environments to record nature sounds like waterfalls, birds, and wind [34]; these methods were utilized in applications related to art and entertainment [35].

Recording devices
A large variety of devices have been successfully employed for spatial sound capture. The function of the capture devices is to record the essential spatial information that enables either physically accurate [33] or perceptually plausible [36] reproduction of the signals at the listener's ears. Probably, the most straightforward recording device enabling binaural reproduction is the binaural microphone (see, for example, [37]), which can be placed on the head of a human subject [32] or on an acoustically designed binaural fixture [2,38]. More complex microphone array systems have been proposed to improve spatial capture resolution and facilitate sound field manipulation. These include the B-format soundfield microphone array (comprised of four capsules located on the faces of a tetrahedron [39]), high-order spherical arrays [40] (that facilitate sound field decomposition and manipulation in the spherical harmonics domain [41,42]), approaches that support flexible recording arrays [43,44], and very large microphonearray systems with interpolation processing [45]. There also exist various perceptually-motivated microphone arrays (PMMAs), designed for capturing acoustic scenes. Whilst high-order arrays attempt to reconstruct the sound field in the reproduction process in a way that is as physically accurately as possible, perceptually motivated arrays focus on plausibly representing the sound field using psychoacoustic cues such as interchannel time-and level-differences and interchannel coherence [36]. The capture devices mentioned above enable various processing methods for enhancing and manipulating the sound field prior to reproduction, as described in Section 3.

Processing
The processing block in Figure 1 transforms the recorded signals from the previous block into binaural signals ready for headphone reproduction in the following block. This aim can be achieved with a wide range of approaches and methods, from binaural recording, which directly produces a binaural signal, to methods such as Ambisonics and beamforming-based processing, which employ microphone arrays and more complex operations. This variety of methods is reviewed in more detail in the following sections, constituting the main part of this review paper. The methods include binaural recording, Ambisonics, perceptually motivated approaches, parametric processing, beamforming-based processing, machine-and deep-learning based methods, and transformations such as signal enhancement, translation and rotation of the listener's virtual position.

Reproduction
The reproduction block in Figure 1 converts the binaural signals back to sounds using electroacoustic transducers. When dedicated transducers are used for the left and right ears, there is no cross-talk between the left binaural signal and the right ear and vice versa, which allows for more direct control of the sound at the ears. The most common device for binaural playback of sound is the headphone [4,46], which comes in different forms, including circumaural (over-the-ear), supra-aural (on-the-ear), earbud, in-ear, and bone-conducting. Some over-the-ear headphones use an open design, allowing audio leakage out of the earpieces and ambient sound leakage into the earpieces. Other headphones use a closed design to preclude leakage. When using headphones to play binaural signals, even though the real sound sources are the electroacoustic transducers at the ears, sounds can still be perceived outside the listener's head by carefully controlling the left and right ear signals. This phenomenon, known as sound externalization, contributes to the realistic perception of a virtual scene [47].
Another factor related to headphone reproduction that contributes to realistic perception is head-tracking, which stabilizes the perceived virtual scene, despite the listener's head movements [3,48,49]. Head-tracking requires dedicated hardware, such as a head-mounted inertial measurement unit that operates in real time with limited latency [50,51]. Finally, the frequency response of the headphone may affect perception, and so often this response is compensated for using headphone equalization [52-54].

Perception and evaluation
The last block of the generalized framework presented in Figure 1 is perception and evaluation. Perception is the aim of the binaural reproduction processto recreate spatial sounds that are perceptually indistinguishable from the real sounds, i.e., the listener perceives the spatial sound authentically as if he/she were actually in the scene [55]. Therefore, evaluating whether this aim has been achieved is of fundamental importance. The evaluation can be both technical, by means of errors in the reproduced binaural signals, and perceptual, by listening tests. Technical evaluation can be performed, for example, by quantifying the errors between a reference signal and the reproduced signal, or by evaluating the accuracy of binaural cues, such as interaural time-and level-differences (ITD and ILD), and interaural cross-correlation (IACC) [56][57][58][59][60].
Perceptual evaluation has traditionally used a global attribute called "basic audio quality" or subjective preference. However, recent studies in spatial audio increasingly tend to evaluate different systems in terms of specific attributes (e.g., [61][62][63][64][65][66]). Examples of such attributes are sound source localization, externalization, coloration, apparent source width (ASW) and listener envelopment (LEV). More general measures for evaluating the overall perceptual accuracy, such as plausibility [67,68] and authenticity [55,69], have also been suggested. Once listening tests have been performed, comprehensive analysis could lead to a perceptual model to replace further listening tests. Examples include localization and externalization models [70][71][72][73][74], and a surround sound quality model [75]. Moreover, machine-learning algorithms have also been suggested for the evaluation of spatial perception [76,77]; this will be further discussed in Section 3.6.
While the auditory attributes stated above have traditionally been studied in the context of a static listener position and head orientation with a fixed perspective, recent developments in VR and AR require 6-degrees-of-freedom (6DoF), where the listener is free to rotate his/her head and also walk around in a virtual or real space. New tools have been developed to perform listening tests and behavioral studies in interactive virtual environments [78]. In a recent study, various direct and indirect audio quality evaluation methods were compared in virtual reality scenes of varying complexity [79]. It was found that rank-order elimination proved to be the fastest method, required the least amount of repetitive motion, and yielded the highest discrimination between spatial conditions. Scene complexity was found to be a main effect within results, while behavioral and task load index results imply more complex scenes, and interactive aspects of 6-DoF VR can impede quality judgments. Recent perceptual studies [80,81] also found that such a dynamic environment could lead to dramatic changes in the perceived reverberation, loudness, ASW and LEV, making evaluation much more challenging under such dynamic conditions.

Processing approaches
This section presents a review of methods associated with the processing block in Figure 1, providing the mapping from the captured microphone signals to binaural signals ready for listening.

Binaural recording
In binaural recording, microphones are placed at the ears of a dummy head, capturing the sound at the ears of a potential listener at the recording position. While binaural recordings have a long history [1], they are still widely used today, as they generate binaural signals, ready for listening, without the need for further processing [4]. While an attractive option in spatial audio, binaural recording suffers from two main limitations, both related to the innate embedding of the HRTF in the recording. The first is that head-tracking is typically not possible, as the head position is captured in the recording. The second is that individualized HRTF cannot be supported, as the signal embeds the HRTF of the dummy head. Solutions to the former exist, such as motion-tracked binaural recordings [82,83], or binaural cue adaptation [84]; however, these are still limited in their accuracy and flexibility. These two limitations call for more flexible recording solutions, in which the sound field is recorded separately from the HRTF, which can then be integrated in post-processing. Such approaches are presented next.

Ambisonics
Ambisonics was first introduced in the 1970s as a way to record and reproduce spatial audio using 4 audio channels, denoted as first-order Ambisonics (FOA) [5,[85][86][87]. Around the late 1990s, the higher-order Ambisonics (HOA) technology, using a spherical harmonics formulation, emerged [6,8,9]. FOA and HOA were originally developed for loudspeaker array reproduction. In 1999, an approach for headphone reproduction of Ambisonics signals was introduced [88], using "virtual loudspeaker reproduction". Headphone reproduction using Ambisonics has been significantly advanced in the past decade as new applications have emerged (see Sect. 1). Specifically, a formulation in the spherical harmonics domain of binaural reproduction using Ambisonics signals, which also employed a spherical harmonics representation of the HRTF [89], was presented [7,42,90,91]. The use of the spherical harmonics formulation has become popular in recent years, due to the possibilities for efficient processing in the spherical harmonics domain, the inherent separation of the sound field and the HRTF representations, and the ease of rotation of these representations, which is useful for head-tracking, for example [16,[92][93][94][95]. Figure 2 presents a general diagram for Ambisonics-based binaural reproduction, showing how the spherical harmonics representations of the sound field and the HRTF are combined to form the binaural signal.
HOA signals can be derived from microphone recordings, typically using a spherical array such as the 4th order Eigenmike [96]. The process of computing the Ambisonics signals is often termed plane-wave decomposition (PWD) [97], because Ambisonics can be related to the plane-wave amplitude density function [18]. However, practical arrays have a limited number of microphones, which may limit the spherical harmonics order and the spatial resolution, and introduces spatial aliasing at high frequencies [98]. Methods that reduce aliasing may extend the frequency range of operation of the array, for example, by aliasing cancellation [99]. Moreover, the typically-small array size affects the robustness of PWD at low frequencies due to the low magnitude of the radial functions that encode scattering off the array [97]. A robust PWD method was recently proposed to overcome these low frequency limitations [100]. Another approach to enhance the Ambisonics signals is by upscaling, which aims to extend the spherical harmonics order, and leads to enhanced spatial resolution and higher-quality spatial audio signals. Earlier work includes the employment of compressed sensing [101][102][103] and sparse decomposition based on dictionary learning [104], while more recent work includes the employment of sparse recovery [105] and deep-learning [106][107][108]. Orderlimited Ambisonics signals translate to order truncation of the HRTF [109], which may have a detrimental effect on the perception of the reproduced binaural signals [93,110]. Several methods that overcome this limitation have been suggested in recent years [94]. Correction of spectral deficiencies by diffuse-field equalization was suggested in [110,111]. Other approaches suggested modifying the HRTF phase component, e.g., time-aligned binaural decoding [95], magnitude least-square (MagLS) [112], and bilateral Ambisonics [56]. The phase was shown to contribute significantly to the increased order of the HRTF [113], and so its modification leads to improved reproduction using low-order Ambisonics.
Ambisonics has been established as a common standard for spatial audio, but even with the improvements described above, it has limitations that drive the search for improved solutions. A main limitation appears when Ambisonics with a low spherical harmonics order is used, for which the binaural reproduction may be of poor quality. Other limitations are detailed next. The frequency range of the Ambisonics signals when captured with compact microphone arrays such as a spherical array may be limited by spatial aliasing and robustness constraints, as discussed earlier in this section. On the positive side, Ambisonics readily supports spatial rotation, which is useful for headtracking and 3 degrees-of-freedom (3DoF) rendering. However, the incorporation of spatial translation is not trivial [114]. Another limitation is that the recording of Ambisonics signals often requires a spherical array, which may not be available when using microphone arrays embedded in consumer devices, for example. Finally, the recording of real scenes may also be corrupted by noise and interference and may require enhancement. Various methods that try to overcome these limitations of Ambisonics are described in the next sections.

Perceptually motivated approaches
As outlined in Section 2.3, PMMAs aim to preserve psychoacoustic cues directly in the microphone-array signals, such that perceptual attributes of the acoustic scene are plausibly rendered. This is in contrast to reconstructing the sound field in a physically accurate manner in postprocessing, an approach often employed when Ambisonics signals are computed from spherical microphone arrays, for example, as reviewed in Section 3.2. In particular, most PMMAs focus on manipulating interchannel time difference (ICTD), interchannel level difference (ICLD) and interchannel correlation (ICC) for virtual image localization and spatial-impression rendering. The concept relies on the perceptual phenomena of summing localization and the precedence effect [69]. Typically, the signals of a PMMA do not require any further decoding process for reproduction; each microphone-array signal is discretely routed to each corresponding loudspeaker. For binaural reproduction using a PMMA recording, loudspeakers are replaced by virtual sources, while the source signals are convolved with the head-related impulse responses (HRIRs) associated with the virtual source positions. This approach, illustrated in Figure 3, offers an attractive advantagebinaural reproduction with good perceptual quality can be achieved even with a small number of microphones.
There exist several models of the ICTD and ICLD tradeoff for controlling the degrees of image shift [115][116][117], that are used for designing the spacing and relative angle between microphones in an array. These models can also be used to affect the characteristics of a virtual source for a given perceived source position. In particular, higher ratios of ICTD to ICLD lead to more spacious, but less localizable, sources, and a greater sense of depth and spread [63]. Achieving a sufficient amount of interchannel decorrelation is another important design goal for PMMAs. Decorrelation is not only important for an auditory spatial impression, i.e., ASW and LEV, [58,118], but also for extending the size of the listening area in loudspeaker reproduction. This is of less importance in binaural reproduction, where the listener is always at the sweet spot [119,120]. Decorrelation is also frequency dependent [121]. Since low-frequency decorrelation has been reported to be important for LEV [122], various decorrelation methods have been proposed [118,123,124]. Furthermore, decorrelation of vertically oriented signals has been found to have a minimal, or no, effect on the vertical spread of virtual sources, depending on source frequency [124,125]. This allows a three-dimensional microphone array to be more compact vertically. Examples include the ORTF-3D [126] and ESMA-3D [59] arrays.
Despite providing good perceptual quality with a small number of microphones, PMMAs do not directly support generic representations like Ambisonics, making this approach specific to a loudspeaker configuration. With this limitation in mind, methods have been developed to transform PMMA signals into Ambisonics. A recent study [127] investigated the perceived spatial and timbral degradation when signals of various PMMAs were directly encoded to Ambisonics with different orders, and binaurally reproduced using the MagLS decoding method [95]. A multiple stimulus with hidden reference and anchor (MUSHRA) listening test revealed that the perceived degradation was minimal with the order of 2 or higher, depending on the decoder. This suggests that Ambisonics could be a useful coding and delivery format for PMMA recordings.  In summary, PMMAs aim for high perceptual quality with a small number of microphones, but come at the cost of highly specific microphone-array designs. Alternative approaches with a similar aim, but supporting more flexible array designs are reviewed next.

Beamforming-based processing
Beamforming-based processing refers to the family of methods that transform microphone signals into a binaural signal in two stages. In the first stage, beamforming, or spatial filtering, is applied to the microphone-array signals, most commonly to represent sound field components associated with specific directions. Then, in the second stage, these components are filtered by the appropriate HRTF and combined to form binaural signals, as illustrated in Figure 4. This is a useful approach, and in its current form it has been developed with great flexibility to array configuration. Ambisonics signals derived from spherical microphone arrays can be considered as a special case of this approach, as detailed below.
Early work developed within this framework employed Ambisonics signals, and did not explicitly use the term beamforming. Here, Ambisonics signals were decoded into signals that are used to directly drive an array of actual loudspeakers [128], or, alternatively, an array of virtual loudspeakers [42]. The set of virtual loudspeaker array signals was further filtered with HRTF to produce binaural signals. This approach was later extended from Ambisonics to spherical arrays in general, by decomposing the measured microphone signals into spherical harmonics and then plane waves, finally reproducing binaural signals by combining each plane wave with the appropriate HRTF [41]. Further work implemented this approach on a real spherical array [129], and analyzed the approach theoretically [130].
Having established the generation of virtual loudspeaker signals and then signals related to PWD, the approach was then extended mathematically to employ beamformers to estimate signals in specific arrival directions. This approach builds on a well-established theory of beamforming [131], with well-defined design methods. Early work incorporated maximum-directivity beamformers, leading to Ambisonics signals and PWD for spherical arrays [132][133][134]. Later, other beamformers, such as the delay-and-sum beamformer, were also investigated [135]. However, these studies were limited to spherical arrays.
Another direction of research work related to beamforming that was also applied to spherical microphone arrays used beamforming or spatial filtering to shape the directivity of the sound field, thus reducing noise arriving from directions attenuated by the spatial filter (see Sect. 4.2), with the entire process embedded in an Ambisonics setting [136][137][138][139]. This approach demonstrated a trade-off between noise reduction and spatial audio quality. A different approach, also related to the methods in Section 4.2, placed emphasis on noise reduction using high-performance beamformers, such as the maximum-directivity distortionless response (MVDR) beamformer [140] and the linearlyconstrained minimum variance (LCMV) beamformer [141]. These approaches only partly supported spatial audio reproduction quality by incorporating constraints in the beamformer design to ensure basic cues of the binaural signals, such as ILD and ITD for specific sources at the beamformer output. This approach did not involve HRTF and the quality of the reproduced spatial audio was limited.
In more recent studies, the beamforming approach developed in previous work was applied to arrays of arbitrary configuration, such as arrays mounted on helmets [142], or glasses [143], linear arrays [43,144], and wall mounted planar arrays [145]. These recent studies extended previous work which was mostly developed for Ambisonics signals. Design methods were further developed by proposing a framework for selecting the number of beamforming directions [145], by direct matching of microphone signals to binaural signals [44,143], and by designing virtual artificial heads [43,144]. The last may require efficient representations of HRTF, e.g., [146]. These are initial steps in the development of methods that will support high quality binaural reproduction based on practical microphone arrays, such as wearable arrays and arrays with arbitrary configuration.
In summary, while considerable progress has been made for beamfroming-based binaural reproduction, most previous work was developed for Ambisonics signals; it may not be possible to accurately compute these signals from signals measured by arrays with a small number of microphones (e.g., from microphones mounted on devices). For such arrays, current beamforming-based design methodology may offer an attractive and flexible alternative; however, at this point in time, further research providing theoretical grounding is required, as well as further development of processing methods to support high quality binaural reproduction from such arrays.

Parametric processing
Parametric processing is based on relatively simple, in some cases, perceptually motivated, sound field modelling. The processing generally consists of two steps. In the first step a specific sound field model is assumed and its parameters and signals are estimated, while in the second step the binaural signals are synthesized. Reproduction based on a small number of parameters may be advantageous when the complexity of the sound field cannot be captured by the recording array. In this case, estimating a small number of perceptually important parameters may be more useful than attempting to capture the full complexity of the sound field.
One of the earliest approaches of parametric signal processing for spatial audio is based on decomposing the sound field into a direct-sound component, representing the sound source, and a diffuse sound component, representing reflections and room reverberation. The approach, referred to as DirAC (directional audio coding) [147], was developed for FOA. A similar approach decomposed the sound field into primary and ambient components [148]. The former component is highly correlated between input channels (representing sources), and the latter are uncorrelated (representing reverberation and background noise). Both approaches process the signals in the time-frequency domain, exploiting the sparsity property of audio signals such as speech. Therefore, while only one source per timefrequency bin is modeled, overall, these approaches can model an acoustic scene with multiple sources. Another alternative, high-angular-resolution plane-wave expansion (HARPEX) [149,150], models two plane waves per timefrequency bin, complemented by two opposing plane waves, thus enriching the plane-wave model.
While useful, these early approaches for parametric spatial audio processing are limited due to their simplistic models [21], and so methods employing more complex models have been developed [20,151,152]. With the aim of extracting multiple dominant plane waves from complex sound fields, sparse recovery approaches have been employed [20,153]. Multiple plane-wave modeling and a more flexible representation of the reverberant part of the sound field have also been the basis for HOA extensions of DirAC [20,[154][155][156], leading to improved spatial resolution and a more accurate representation of complex sound fields. This approach, developed for Ambisonics signals and spherical arrays, has been extended to incorporate general microphone arrays, by employing optimal multiple channel filters to estimate direct signals from sources [20,152,157]. Figure 5 presents a general block diagram, capturing the main processing blocks common to parametric spatial audio signal processing for binaural reproduction. While the approaches discussed above are often presented in the context of loudspeaker reproduction, they are nevertheless relevant for headphone reproduction by employing virtual loudspeakers, or by rendering sources by incorporating HRTF [20,21].
Overall, parametric processing has been a promising avenue for binaural reproduction from microphone-array recordings, as it has the potential to capture important spatial information through the modelling process. Further research may provide high-quality reproduction even with challenging environments that include multiple dynamic sources, spatially complex sources [158,159], reverberation and noise, and by employing compact arrays with only a few microphones. Improved methods for estimating information on individual sources and on reverberant components, as well as methods that incorporate early room reflections [160][161][162][163], may advance the parametric approach even further. The parametric processing approach also supports signal transformations such as rotation and translation, due to the simplified sound-field representation, as will be further discussed below.

Machine-and deep-learning based processing
With the advent of deep-learning methods, machine learning has seen broad application for a wide variety of research problems, including in the fields of audio and acoustics. Recently, novel machine-learning-based methods that fit within the generalized framework shown in Figure 1 have been proposed.
Understanding the characteristics of the acoustic environment may be useful in a spatial audio processing framework (see Sect. 2.2). The annual IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) [164] includes contributions related to automatically classifying the type of acoustic scene [165], or detecting and localizing sound events from spatial audio recordings [166]. Grumiaux proposed the use of the time-domain velocity vector as an input feature for a deep neural network (DNN) to count and localize multiple speakers in Ambisonics signals [167]. A related problem is the blind estimation of room acoustic parameters from audio recordings [168], including the estimation of reverberation time and the early-to-late reverberation ratio [169][170][171][172][173].
Given a recording of an acoustic scene, data-driven and machine-learning approaches can be used for audio processing (see Sect. 2.4). A method has been proposed in [174] for upmixing monophonic recordings to FOA by combining audio processing and computer-vision methods to infer sound source locations from a panoramic video recording of the scene. The estimated Ambisonics signals can then be processed further for binaural reproduction. Directly deriving the binaural output signals from a monophonic recording has also been proposed, taking into account the position and orientation of the listener relative to the source [175]. In another study, convolutional neural networks were employed to upscale the Ambisonics order of encoded FOA recordings [106]. On the reproduction side, general adversarial networks were proposed to reduce the error when rendering Ambisonics-encoded sound fields over four loudspeakers [108]. Finally, data-driven and machinelearning-based approaches have been proposed for the perceptual evaluation of reproduced scenes. A model that predicts front-back and elevation perception of sound sources was introduced in [70], while predicting spatial audio quality using computational models was proposed in [77]. For an extensive review of current data-based spatial audio methods the reader is referred to the work by Cobos et al. [176].
With the increased popularity of machine-and deeplearning research, it is expected that these approaches will play an increasingly more significant role in the near future for spatial audio, in general, and for the binaural reproduction of recorded acoustic scenes, in particular. As these approaches are data-driven, they have the potential to overcome limitations imposed by microphone-array configurations, and to implicitly exploit information embedded in the sound field, leading to highly flexible solutions; nevertheless, these solutions may require tailoring to specific systems and applications.

Transformations
This section presents processing methods that can be considered as additions to the main processing chain of mapping microphone signals to binaural signals, as illustrated in Figure 1. These include signal enhancement to reduce unwanted interfering sounds in the spatial audio signal, and translation and rotation that support the mobility of a listener in a virtual audio environment.

Rotation and translation
During binaural reproduction with headphones, listeners may rotate their heads, leading to a corresponding rotation of the acoustic scene, which is perceived as unnatural. This can be corrected by head-tracking, i.e., rotating the acoustic scene to counter the listener's head rotations, thereby stabilizing the virtual scene and providing the feeling of immersion in a real scene. This head-tracking is denoted as having 3DoF. Furthermore, listeners may move freely, i.e., walk through the reproduced scene with a combination of rotational and translational movements. The latter refers to moving forwards and backwards, up and down, and left and right. The translation is often referred to as sound field translation, sound field navigation, or scene walk-through. When paired with rotation, the complete freedom of movement is denoted 6DoF. The objective of 6DoF reproduction is to enable a listener to walk through an acoustic scene in VR/AR, leaning close to sound sources or reflectors and hearing a realistic life-like recreation of the true experience (ideally with matched visuals).
A schematic illustrating how recordings are compensated for listener rotation and translation for the case of an Ambisonics signal is given in Figure 6. Typically, the A straightforward method for enabling head rotation is to record the scene with multiple binaural microphones at different azimuth rotations. For example, by having microphones [82,177] or binaural microphones [178] placed around the equator of a sphere. During reproduction the listener's head rotation is tracked and the microphone signals closest to the ears are interpolated or directly played back. However, currently, head rotation for 3DoF is more commonly achieved by rotating the Ambisonics representation of the sound field [128,179], or equivalently, rotating the Ambisonics representation of the HRTF [180]. Ambisonics rotation is easily performed by applying a time-and frequency-independent rotation matrix to the Ambisonics coefficients [181][182][183][184][185][186]. The challenges of headrotation-enabled binaural reproduction for recording by non-standard or wearable microphone arrays [187][188][189] are still the subject of ongoing research.
There are three main approaches towards enabling 6DoF, each distinguished by the recording setup. The first is a source-based approach, where spot microphones are used to record each sound source individually within the scene [190]. The recorded scene is virtually pieced back together by representing the sources as virtual objects at similar positions. The virtual object signals are panned and amplified depending on the listener's real-time position and rotation. This approach is easy to adapt to different binaural rendering methods. However, no source-directivity information is captured, and the specific acoustics of the environment are not typically captured or reproduced.
The second 6DoF technique, denoted extrapolationbased, records the acoustic scene from a single spatial position, usually with a single HOA microphone. The Ambisonics recording is processed into a secondary representation built out of virtual loudspeakers [191][192][193], virtual microphones [194], virtual near-field point-sources [195][196][197][198], virtual far-field plane waves [199][200][201][202], or virtual near-and far-field sources [203]. Alternatively, the HOA recording can be directly re-expanded about a translated position without the secondary representation [182,204]. The extrapolation-based translation, however, is usually limited by a sweet-spot distance that is defined by the well-known truncation properties of the Ambisonics decomposition [205]. To address this limitation, methods often use additional assumptions or parametric information about the recording to gain extended translation. For example, methods use known or estimated source directions and/or distances [192,196,198,200,206], a distance map [197,199], or spatial sparsity assumptions [203,207]. While the translation distance is limited, the extrapolation-based approach benefits from its unobtrusive and cost efficient use of a single HOA microphone. Lastly, the potential applicability to other single recording devices, such as wearable microphone arrays, suggests that the extrapolation approach will continue to develop.
The third 6DoF technique, denoted interpolation-based, records the scene from multiple spatial positions with a distributed grid of Ambisonics (first-order or higher-order) microphones. Existing approaches for Ambisonics interpolation can broadly be classified into two categories: parametric approaches in the time-frequency domain, and broadband approaches in the time domain. The parametric approach exploits time-frequency analysis of the multiple Ambisonics recordings to infer underlying source characteristics (mainly the location information), which are then explicitly [208][209][210][211][212] or implicitly [213][214][215] used to render the reproduced sound field at interpolated listening positions. Tracking-based solutions for moving sources have also been proposed [216][217][218][219][220]. Additional information on source locations enlarges the supported range of shifted listening perspectives with high spatial definition, yet the time-frequency processing often results in musical noise artifacts. In contrast, broadband approaches such as weighted averaging and virtual loudspeaker objects (VLO) make no attempt at analyzing underlying source characteristics, and their time-domain processing avoids the risk of introducing musical noise. The weighted averaging method [221][222][223] applies distance-based weights to each recording and has a few notable shortcomings, including a limited listener movement region and poor localization accuracy. In contrast, the VLO method [193,224] maps the recordings to multiple surround playback rings of virtual loudspeaker objects, whose direction and amplitude vary with the desired listener position, thus providing enhanced spatial fidelity. More recently, in [225,226], the authors presented methods that merge and extend the concepts of parametric and broadband interpolation. Overall, interpolation-based approaches offer potentially longer translation distances, but with the trade-off of the increased costs associated with using multiple HOA microphones [227,228].
While numerous rotation and translation solutions have been recently developed, capturing and accurately reproducing large acoustic scenes still remains an open problem. Future directions include the extension of these methods to more general signals beyond Ambisonics, and improving the accuracy and translation regions to support realistic free walking in virtual and augmented reproductions of captured sound scenes.

Signal enhancement
Spatial audio signals may be composed of both desired components, such as speech and music, and undesired components, such as noise and interfering sounds. Therefore, in addition to processing aiming at binaural reproduction, signal enhancement may also be required in order to attenuate the interfering components, thereby delivering to the listener high quality spatial audio which is also clean. This problem has been investigated for hearing aids, where the delivery of clean speech is of great importance, while binaural hearing aids also aim to deliver spatial cues to the listener. With this in mind, binaural beamformers that aim to attenuate undesired signal components [140,141] have been developed; these were also extended to include time-frequency masking [229]. However, because spatial information relies on beamforming constraints, it is only partially preserved in the binaural signal. Furthermore, these methods are designed for binaural microphone arrays and may not always be applicable to general arrays.
With the aim of overcoming the limitations of binaural signal enhancement, several studies developed enhancement solutions for Ambisonics signals. In the first approach, directional constraints were introduced into the Ambisonics encoding process to attenuate directional interferences. Then, with the aim of affording more flexibility to target noise fields that are not highly directional single sources, a directional shaping filter that allocates higher directional gain to directions with higher signal-to-noise ratio was introduced. This processing operates directly on the Ambisonics signal [136,230], and while defined in a closed mathematical form, leads to a trade-off between enhancement level and reproduction quality. Later research aimed to provide significant enhancement while perfectly preserving the desired spatial audio signal. Designed for Ambisonics signals, this aim is achieved by first estimating the DOA of the desired source, then estimating the source signal using high-directivity beamforming, and finally estimating the transfer function from the source signal to the Ambisonics signals. This process leads to a reconstruction of the desired Ambisonics signal with the full spatial information, while providing enhancement through the contribution of the beamforming [231]. Recently, this approach was also investigated for a wearable microphone array [143,158]. An alternative approach, also aiming to achieve significant enhancement while preserving spatial information, employed masking in the time-frequency domain, applied directly to the Ambisonics signals or to the same signals spatially transformed by beamforming [232]. While high noise attenuation was achieved by masking in the transformed spatial domain, masking in the Ambisonics, or spherical harmonics domain, better preserved spatial information in the attenuated noise. Some of these approaches were generalized in a broad framework for signal enhancement [233], which incorporates source signal estimation under various sound field models in a way that preserves both the individual sources and the reverberant signal components, while minimizing the contribution of undesired noise.
While recent methods for the enhancement of spatial audio signals introduced a significant improvement compared to early methods, improved methods that provide superior performance in challenging environments with multiple speakers, reverberation and noise, may be highly desirable when considering realistic scenarios. In addition, many of the methods are designed for Ambisonics signals and spherical arrays, and so enhancement methods for more general array configurations may also be necessary.

Conclusion and outlook
This review paper presented an overview and recent developments in spatial audio signal processing, focusing on recorded sound and binaural reproduction. Significant progress has been made in the past decade, with the proposal of new methods and approaches, making a notable step towards providing high quality audio from recorded sound. Nevertheless, there are clear challenges ahead. These are outlined within the structure of the general framework presented in this paper.
Acoustic scenereal-world scenes may be challenging, with several moving sources, reverberant environments, noise and interference. Most methods developed to date assume stationary sources, and so bridging the gap to handle several, and moving, sources in lively environments could be an important target for future research. Recordingspatial audio signals are recorded by microphone arrays, and with emerging applications such as smart homes and VR/AR, arrays may be of varying configurations (e.g., on a device), may be composed of only a few microphones, and may be dynamic in space (e.g., wearable arrays). With many of the current methods developed for spherical arrays, an important challenge is to extend emerging methods to work with general arrays and perform well even with moving arrays composed of only a few microphones. Wearable arrays may also introduce challenges with respect to the limited computation resources available, and latency constraints imposed by real-time reproduction and head-tracking, for example. Overcoming these challenges may open great opportunities for delivering affordable spatial audio for consumer devices. Processingwhile signal processing has been the main topic of this paper and is incorporated in the points above as well, a main avenue of research that has been reviewed here is spatial audio signal processing based on learning from measured data. With deep-learning methods continuously developing, their incorporation in the challenging tasks outlined here could be of great benefit. Learning from measured data could also include parametric representation of sound fields based on microphone-array recordings, which have great potential for high performance with compact representations. Furthermore, emerging approaches for manipulating sound field information for translation and rotation, for example, by non-linear transformation of the directional space (i.e., warping), may lead to new possibilities and increased flexibility for VR/AR and other applications. Reproduction and perceptionover headphones, and, in particular, using individualized HRTF, will probably be key to high quality spatial audio. The incorporation of individualized HRTF in state-of-the art algorithms is therefore essential. Furthermore, improved understanding of the relation between the processed audio signal and perception may be essential to ensure that important signal information is maintained or enhanced. Performance evaluations, currently mostly developed for listeners with headtracking, should be extended to 6DoF motion. Also, mathematically formulated objectives, essential for machine-and deep-learning, that incorporate perceptual attributes, could be useful for developing databased learning solutions that are perceptually motivated.