SCaLAr – A surrounding spherical cap loudspeaker array for flexible generation and evaluation of virtual acoustic environments

Introduction: Surrounding spherical loudspeaker arrays facilitate the application of various spatial audio reproduction methods and can be used for a broad range of acoustic measurements and perceptual


Introduction
Research on loudspeaker-based spatial audio reproduction methods requires carefully designed surrounding spherical loudspeaker arrays to enable practicality of spatialisation methods, aiming at plausible or even authentic perception of simulated virtual acoustic environments (VAEs) [1][2][3]. Commonly used methods either rely on binaural technology for loudspeaker-based binaural playback in combination with acoustic crosstalk cancellation (CTC) filters [4][5][6][7], or on panning methods such as higher-order Ambisonics (HOA) [8][9][10][11][12][13], vector base amplitude panning (VBAP) and multiple-direction amplitude panning (MDAP) [13][14][15][16]. Wave field synthesis shall be mentioned as a further sound field synthesis approach [17][18][19], although the use of dense linear arrays would be preferred. Since an anechoic chamber is not always available or convenient for permanent array installations, conventional rooms, sometimes acoustically optimised, are regarded as acceptable for the targeted application [20][21][22]. Although additional reflections in such rooms can be helpful to conceal known perceptual shortcomings of the respective reproduction method [13,19,23], particularly objective evaluations require that the array is preferably installed in an anechoic room (e.g., [24,25]). An unobtrusive mounting construction further helps to decrease sound field distortions that could lead to biased reproduction error metrics and perceptual test results.
The flexible reproduction possibilities facilitate research in a variety of areas. Typical applications include objective evaluations of spatialisation methods [26] and spherical microphone arrays [27,28], measurements of individual head-related transfer functions (HRTFs), as well as research on hearing aids [29][30][31]. An optimal reproduction environment also allows to compare perceptual metrics after systematic modifications of system parameters or rendering components. Of particular interest is, for example, how the source localisation differs between virtual sound sources (VSSs) based on individualised HRTFs [32,33] and real sound sources, usually represented by discrete loudspeakers [34]. In order to continuously approach ecological validity, objective evaluations of simulation methods for VAEs with room acoustics [35] should be supplemented by perceptual evaluations to substantiate their validity [36,37]. Combined with physiological measures, nonintrusive, controlled experimental environments further represent a tool for assessing environmental noise effects [38]. However, applications are not restricted to the auditory domain but can be extended to multimodal experiments using head-mounted displays for the exploration of auditory perception and cognition [39,40].
A number of international research groups have already designed and implemented surrounding loudspeaker arrays (e.g., [16, 20-22, 24, 25, 30]). As there is often no room for comprehensive implementation details, this report is intended to provide a compact but more accessible template for similar future projects. We present the design of a surrounding spherical cap loudspeaker array, including a description of the reproduction environment, the array mounting construction, the electroacoustic features of the loudspeakers with custom-made cabinets and the signal network implementation. Based on commonly used performance metrics, the sampling layout is analysed for its suitability using various spatial audio reproduction methods, accounting for physical imperfections.

Reproduction environment
The loudspeaker array was mounted in the anechoic chamber of the Institute of Technical Acoustics, RWTH Aachen University, with the dimensions 9.2 m Â 6.2 m Â 5 m (length Â width Â height), resulting in a room volume of about 285 m 3 , see Figure 1. All room surfaces are covered with 0.7 m long glass fibre wedges, determining a lower cutoff frequency of approximately 200 Hz [41]. The wedges and the ceiling construction as well as a net made of steel cables 0.2 m above the floor wedges reduce the effectively usable room dimensions to 7.8 m Â 4.9 m Â 2.8 m (length Â width Â height).

Sampling layout
Since the floor net construction in the anechoic chamber cannot be locally detached, loudspeaker positions at zenith angles close to the south pole are unfeasible, limiting the design to spherical cap layouts when the radius should still be adequately large. Primarily for reasons of sampling efficiency and because of the requirement for a ring-based layout, an equal-area sampling [42] with 75 points and a spherical harmonic (SH) order of N = 7 for a nominal array radius of 1.35 m was generated in MATLAB (MathWorks, Natick, Massachusetts, United States) [43]. The removal of physically impractical loudspeakers at the south pole and on the ring above led to a total number of 68 loudspeakers, populating zenith angles up to about 134°. Serving as a visual anchor, one loudspeaker in the horizontal plane at 0°azimuth represents the default viewing direction of the listener, see Figures 1 and 2.

Motion tracking
Binaural reproduction over loudspeakers in particular requires that the auralisation system knows the listener's current real-world position and orientation, relative to the real-world loudspeaker positions [44,45]. For behavioural evaluations, the motion tracking data collected from participants when solving experimental tasks may provide crucial information about possible applied listening strategies, thus enabling complementary data analysis [46,47]. We therefore integrated an optical motion tracking system (OptiTrack, NaturalPoint Inc., Corvallis, Oregon, USA), consisting of four infrared cameras (Flex 13), that transmits the data stream via tracking hub (OptiHub 2) to a desktop computer (Intel Ò Core i7-8700). Synchronised data logging is possible using the dedicated software (Motive) together with the NatNet software development kit and MATLAB [43]. With an imager resolution of 1280 Â 1024 pixels (1.3 MP resolution, 56°Â 46°field of view), a maximal native frame rate of 120 Hz and a specified latency of 8.3 ms, the tracking system can resolve objects up to 9 m away at a 3-dimensional accuracy of ±0.2 mm [48].

Mounting construction
In order to keep the overall weight and the influence on the reproduced sound field low, the frame construction consists of carbon fibre and aluminium poles with 10 mm diameter, see Figure 1. The loudspeakers in the upper hemisphere were suspended at their centre of gravity on brackets that allow height adjustment, tilting and rotation, and on vertical poles, which in turn were attached to horizontal ceiling poles. To enable correct loudspeaker positioning, these movable horizontal ceiling poles were arranged as spokes between the two concentric ceiling rings with radii of 1.5 m and 0.5 m. For the loudspeakers in the lower hemisphere, including the horizontal plane, we constructed three pentadecagonal rings with decreasing incircle radii of 1.44 m, 1.33 m and 1.07 m that were vertically connected by relocatable aluminium poles. This lower hemispherical ring construction was suspended by 15 vertical aluminium poles from the outer ceiling ring and additionally stabilised with rods to avoid lateral movements. Attaching the loudspeakers to the horizontal carbon rods with sliding and rotating clamping elements allowed the radius and angular orientation to be individually adjusted in the area of the respective centre of gravity.
The door element to enter the array, holding two loudspeakers, demanded careful design as it directly influences the internal stability of the whole mounting construction when opened and general practical usability. A hinge construction, defined end positions of the pole terminations, and additional braces ensure minimal overall mechanical impact and reproducible loudspeaker positions after each use.

Actual loudspeaker positions
The positioning and alignment of the loudspeakers was mainly carried out using laser measurement tools. In a first step, the centre of the north pole loudspeaker was projected onto the ground with a self-levelling cross-line laser (GLL 3-80, Bosch Professional, Gerlingen-Schillerhöhe, Germany). The cross-line laser was then rotatably attached to a compass rose in this ground centre to read the azimuth angles and align the respective loudspeakers accordingly. Additionally, a rod was mounted on the north pole loudspeaker with a bayonet lock, at the end of which a swivelling distance laser (ADM30, FLEX-Elektrowerkzeuge GmbH, Steinheim/Murr, Germany) with a vertical compass rose enabled to measure the respective elevation angles and distances of the loudspeakers. We used an electronic spirit level and the cross line laser to verify the inclinations and heights of the individual loudspeakers per ring, respectively.
For a correct configuration of the auralisation software (e.g., [46,49,50]) and to simulate the sound field synthesis performance given loudspeaker positioning errors, we measured the actual loudspeaker positions with the optical tracking system. For the definition of the tracking system's real-world coordinate system, the centre of the array was determined using a light aluminium replica of the original calibration square (CS-100, OptiTrack, NaturalPoint Inc., Corvallis, Oregon, USA). This calibration square was attached to a ball-and-socket joint on the vertical rod of the north pole loudspeaker and aligned in the horizontal plane and with respect to the centre of the floor using the cross line laser. Since reliable and accurate tracking in outside-in systems is limited to a certain detection volume, which develops around the centre of the array depending on camera positions and orientations, a tetrahedral rigid body consisting of four reflective markers was mounted on a 70 cm long carbon fibre rod. The tip of this rod was used to measure the loudspeaker membrane centres by translating the rigid body's pivot point accordingly.

Participant chair
A 600 mm Â 600 mm Â 30 mm (length Â width Â height) aluminium grillage, which is supported on the solid ground below the floor wedges by four threaded rods and braced with the floor netting by diagonally and laterally crossed steel cables, allows to attach a socket for the participant chair (IS 1926, Dauphin HumanDesign Group GmbH & Co. KG, Offenhausen, Germany). Optionally, a table covered with absorbers can be mounted to accommodate wireless user input devices (e.g., mouse, keyboard, tablet). To align the participant's interaural axis with respect to the loudspeaker array's centre, the chair can be moved back and forth and adjusted in height. Depending on the experimental design (e.g., a virtual all-around search task), a rotation around the listener's longitudinal axis is made possible or prevented by an adjustment screw. If necessary, the participant's head movements can be restricted using an unobtrusive height-adjustable head rest.

Signal network
To transmit the audio and control signals a network solution was chosen, interlinking the majority of hardware devices via 16-port Gigabit Ethernet switches (DGS-1016D, Netgear, San Jose, California, USA) and cables, see  The use of a microphone preamp, A/D converter and ADAT connection protocol (RME Octamic XTC, Audio AG, Haimhausen, Germany) in combination with a digital audio format audio converter (REDNET 3, Focusrite, High Wycombe, UK) enables to capture up to 32 analogue signals for various measurement applications (e.g., measurement microphones, artificial heads, microphone arrays).
Audio playback relies on 17 4-channel power amplifiers with passive cooling, on-board digital signal processor, low-noise Pascal amplifier modules (112 dB(A) output signal-to-noise ratio) and Dante Ò interface (PPA 1000-4-PC DSP, Four Audio, Herzogenrath, Germany). The amplifier modules were built into custom-made aluminium racks with air vents that hold up to four devices and one network switch. The rack-mounted network switches are connected in series while we used a star-shaped concept for the connection of the amplifiers per rack. Individual device and channel configurations are accomplished via an independent control network with a dedicated PCIe network card. A 4-channel Dante Ò headphone amplifier (KLANG:QUELLE, KLANG:technologies GmbH, Aachen, Germany) enables playback over open dynamic headphones (HD 650, Sennheiser, Wedemark, Germany). For safety reasons, we installed two kill switches within reach of the experimenter and participant, which immediately interrupt the power supply to all power amplifiers.

Supervision
Visual supervision and verbal communication with participants taking part in perceptual experiments is possible via network camera with night vision (RLC-410, Reolink, Hong Kong, China) and an independent custom-made talkback system, respectively. A pair of control headphones can be used for channel-based verification of playback signals and to listen to the same material presented to the participant or a binaural downmix [51].

Electroacoustics
All measurements in the remaining parts of this section, except those related to the in-situ finite impulse response (FIR) filter verification measurements (Sect. 2.6.2), the background noise level (BNL) (Sect. 2.6.4), and CTC performance evaluation (Sect. 2.7.4), were carried out in the hemi-anechoic chamber of the Institute of Technical Acoustics, RWTH Aachen University, with dimensions 11 m Â 5.97 m Â 4.5 m (length Â width Â height), exhibiting a lower frequency limit of about 100 Hz [41]. We used the same signal network as described in Section 2.5. Further measurement details are provided in the corresponding subsections.

Loudspeaker cabinet and crossover design
To favour confined spatial sampling and minimise unwanted phase effects of low-and high-frequency drivers located at different positions, we used 4-inch coaxial 2-way loudspeakers (Seas L12RE/XFC H1602-04, Moss, Norway), which were built into 4 L octagonal prismatic cabinets (sidewalls 6 mm, front and backside 12 mm thick birch multiplex), representing a bass reflex system with a cylindrical rear panel port of 100 mm length and 35 mm diameter, see Figure 4. Both cabinet and crossover design were developed based on the acoustic measurement results described below using a simulation software [52].
To measure the individual driver's impulse responses, an exponential sweep with a length of 2 17 samples at a sampling frequency of 44.1 kHz between 20 Hz and 20 kHz was generated in MATLAB [43] and played back over an example loudspeaker. Positioned on the floor at a distance of 1.4 m, the loudspeaker was tilted towards the measurement microphone (Brüel & Kjaer Type 4189 and 2669, Naerum, Denmark), which was used together with a conditioning amplifier (Brüel & Kjaer Type 2610, Naerum, Denmark). Finally, the resulting impulse responses were windowed in time domain (Hann window, 1.5 ms fade-in after 0.5 ms, 5 ms fade-out after 25 ms).

Spectral equalisation
For the design of the FIR filters, we repeated the measurements described in Section 2.6.1 for all loudspeakers. Individual 512-taps FIR filters with highpass filter (4-th order Butterworth, 80-Hz cut-off frequency) and 1/12-octave band smoothing were created in the software filter design module (System Designer, Four Audio, Herzogenrath, Germany) and uploaded to the corresponding digital signal processor of the power amplifiers. Adding the power amplifiers' system latency of 2.54 ms, the filter target latency of 8 ms and a maximum 1-ms Dante Ò network latency results in an overall output latency of 11.54 ms, without taking into account the acoustic delay due to the loudspeaker distance and additional delays for radial corrections, cf. Section 3.1.
To investigate the acoustic influence of the physical setup, i.e., the loudspeakers and mounting construction, the measurements described in Section 2.6.1 were repeated for an example loudspeaker when installed as part of the array with active FIR filter. The same time window settings were applied to ensure comparability.

Directivity pattern
For this measurement, the example loudspeaker was mounted on a 1 m long pole, which was connected to a custom-made turntable. The turntable was tilted so that the loudspeaker's on-axis direction pointed towards the microphone, positioned on the floor at a distance of 5.4 m. This distance was chosen in order to keep the required inclination of the turntable in a reasonable range and to minimise unfavourable reflections due to the measurement setup as far as possible. A remote-controlled stepping motor (PD4-N, Nanotec, Feldkirchen/Munich, Germany) rotated the turntable in steps of 1°to obtain the loudspeaker's directivity pattern in the horizontal plane, assuming a rotationally symmetric behaviour. The corresponding impulse responses were spectrally smoothed using 1/6-octave band filters, time-windowed (Hann window, 0.5 ms fade-in after 0.9 ms, 5 ms fade-out after 35 ms) and cropped [43].

Background noise level
A low-noise measurement microphone (40HL, GRAS Sound & Vibration A/S, Holte, Denmark) was placed in the array centre and used in combination with a conditioning amplifier (Brüel & Kjaer Type 2690-A, Naerum, Denmark) and an otherwise unchanged hardware setup to measure the BNL when all loudspeakers were connected to the power amplifiers. As a baseline, we first measured the BNL with power amplifiers switched off. In a second measurement, all power amplifiers were switched on to investigate the combined noise contribution of the power amplifiers and the loudspeakers during standby operation. Both measurements were performed for a duration of 60 s.

Performance metrics
Previous work introduced the sum of squared loudspeaker gains,Ê, as an energy measure to estimate direction-dependent loudness of panning-based spatial audio reproduction methods [53]. The magnitudes of velocity [54] and energy vectors [53],r V andr E , respectively, are considered as additional quality indicators for the angular mapping performance [12,55]. There is perceptual evidence that these metrics correlate with detecting the direction of plane wave incidence at low frequencies (below 700 Hz), while enabling binaural evaluation of the maximal spatial energy concentration for source localisation at higher frequencies and perceived VSS width in the intended sound field synthesis area [13,23,53,54].
The selected metrics were considered to provide an overall performance evaluation of the various decoder strategies described below. Each decoder was calculated for the ideal and actual sampling layout. The resulting decoderdependent loudspeaker gains were used to estimateÊ,r V andr E for directions of incidence on an equiangular grid with a resolution of 1°Â 1°in azimuth and elevation, covering the entire sphere [56]. The choice of these directions of incidence should also indicate problems in decoders that are not designed for reproduction outside the spatial range covered by the loudspeaker array. To further assess the perceptual consequences of reproduction errors, the directional deviation between the respective vector direction and the target grid direction, V and E , as well as the perceived source width a [23] were evaluated [56]. In addition to acrossdecoder comparisons, this strategy allows to investigate the influence of actual loudspeaker positioning errors on the simulated performance of the selected panning-based reproduction techniques and to analyse their general suitability.

Vector base amplitude panning
Since VBAP relies on convex hull triangulation, the available sampling layout provides an unproblematic basis for the selection of valid loudspeakers, pairs or triplets, as long as VSSs are synthesised from directions lying on the spherical cap. For directions towards the south pole, where this approach is likely to suffer from numerical instabilities and unsatisfactory perceptual results, the insertion of one or more imaginary loudspeakers [57] comes in handy by allowing to properly discard unfeasible VSS directions or apply downmixing to nearby loudspeakers [13]. To control for constant source width and reduced colouration, particularly in the case of moving VSSs, the use of auxiliary spreading sources, as done for MDAP, was recommended in previous work [15,23].
As baseline panning variants, we evaluated the performance of both the VBAP and MDAP decoders. For VBAP, triangulations in the lowest zenith angle range were permitted, since the maximum recommended by-triangle loudspeaker opening angle of 90°is maintained in the given sampling layout [16]. As an example MDAP configuration, we used a number of 12 spreading sources, i.e., virtual auxiliary sources that are equally distributed on a concentric ring around the corresponding target direction of the VSS. This concentric ring was defined at a spreading angle of 10°, i.e., the angle relative to the target direction of the VSS [15]. The final loudspeaker gains based on the entire set of sources per target direction were calculated using the proposed energy-based variant of VBAP, i.e., vector base intensity panning [15,56,58].

Higher-order Ambisonics
In contrast to vector-based panning methods with discrete loudspeaker weights, HOA allows the sound field to be excited using superposed orthogonal basis functions, represented by SHs, enabling the use of a continuous virtual panning function for spatial reproduction of previously encoded source signals [10,13]. The limitation of the maximum SH order entails an upper frequency limit for periphonic playback, above which spatial aliasing occurs [59]. With the available number of loudspeakers, determining a maximum and full-sphere equivalent SH order of N eq = 7 [57], and an assumed valid reproduction in a sweet sphere of 15 cm diameter, an upper frequency limit of approximately 5.13 kHz can be estimated [60]. Partial spherical coverage and small array radii pose additional challenges regarding the ambisonic decoder design [57,61]. For general comparison purposes (see also, [55,58]), the decoder strategies given below were calculated [56]: sampling ambisonic decoder (SAD) [62], mode-matching ambisonic decoder (MMAD) [63] with improved regularisation [55], energy-preserving ambisonic decoder (EPAD) [64], improved all-round ambisonic decoder (AllRAD+) [55,57].
We applied a virtual t-design of the order 2N + 1 = 15 for implementing the AllRAD+ decoder [57,65].

Acoustic crosstalk cancellation
The original transaural stereo approach required a set of two loudspeakers to convey a binaural audio signal by applying static acoustic CTC filters [4,5]. The simultaneous use of multiple loudspeakers is possible via an L-CTC approach, with L representing the number of involved loudspeakers. Global minimum-phase regularisation allows to remove anti-causal artefacts and generate stable CTC filters [7,66]. In combination with the optical tracking system, cf. Section 2.3, rotational and translational sweet spot extension enables real-time reproduction of dynamic VAEs [50] with optional plausible room acoustic simulations [35]. Since HRTFs are typically referenced to the centre of the interaural axis with absent head [2] and any change in the listener's head position and orientation should also be referenced to this centre, the pivot point of the headmounted rigid tracking body must be corrected by applying an individual translational offset [43]. However, it should be noted that individual anthropometric differences to generic HRTFs [67], which are likely to reduce the localisation performance [68], are not considered by this correction and require the use of individualised [32,33] or the measurement of individual datasets (e.g., [69][70][71][72]).
With sufficient processing capacity, it is theoretically possible to use all 68 array loudspeakers for binaural playback. However, we were interested in finding suitable loudspeaker arrangements and estimating the minimum number of loudspeakers required for sufficient channel separation [73]. Therefore, we calculated the optimal and achievable channel separation for a variety of different loudspeaker subset layouts with an increasing number of loudspeakers, cf. Figure 5, and the full layout. The first two simulated arrangements followed the recommendation for the use of elevated loudspeakers [74], with all arrangements aiming for a reasonably uniform spatial distribution [45].
To estimate the optimal and achievable channel separation using the presented experimental setup, the binaural impulse responses of each loudspeaker were measured sequentially from an artificial head with detailed ear and simplified torso geometry [75] in the array centre. As excitation signal, an exponential sweep with a length of 2 16 samples was used with otherwise unchanged measurement hardware and software settings, as described in Sections 2.5 and 2.6, and activated FIR filters. The raw binaural impulse responses were windowed (Hann window, 1 ms fade-in after 1 ms, 1 ms fade-out after 8 ms). To account for the transducer characteristics of the loudspeakers and artificial head microphones (MK 2H, Schoeps GmbH, Karlsruhe, Germany) and preamp (CMC 6, Schoeps GmbH, Karlsruhe, Germany), the on-axis measurements of both devices were inverted, implemented as minimum-phase filters and convolved with the windowed binaural impulse responses, resulting in the final playback HRTFs, which were cropped to a length of 256 samples. Based on these HRTFs, we calculated CTC filters using an L-CTC approach and a regularisation factor of 1e-3 [66] for the corresponding layouts shown in Figure 5 and the full layout. Since we were interested in seeing how the physical setup affects the channel separation, the CTC filters were applied on the playback HRTFs and the equalised but unwindowed spatial transfer functions, the latter exhibiting a cropped length of 11,025 samples. For these two scenarios, the optimal and achievable channel separation was calculated for the left ear only [7] due to comparable across-ear performance.

Physical implementation
The ideal and actual loudspeaker positions are plotted in Figure 2. Figure 6 shows the resulting deviations, split into radial, azimuth, zenith and great circle central angle error components, R , u , 0 and c , respectively. Note that the radial error was calculated after applying digital by-channel delays in steps of two samples on the digital signal processors of the power amplifiers. This allowed discrete corrections of 16 mm at a sampling rate of 44.1 kHz and a speed of sound of 343 m/s to obtain a virtual array radius of 1.44 m, with channel gains adapted accordingly. If more precise radial corrections are required the individual loudspeaker signals can be delayed exactly using, for example, fractional delays [76,77] or by setting the FIR filter's group delay.
(a) (b) (c) Figure 5. Loudspeaker subset layouts to investigate the CTC performance in terms of optimal and achievable channel separation. The horizontal and vertical arrows represent the listener's default orientation by the according view and up vectors.

Loudspeaker cabinet and crossover design
The measured on-axis sound pressure transfer functions of the two drivers in the final cabinet are shown in Figure 7a. Based on these results, an optimisation routine adapted the electronic components of the crossover circuit in a way that the combination of loudspeaker response and filter matches the target responses of a 4-th order Linkwitz-Riley crossover, see Figure 7b. A crossover frequency of 2 kHz was chosen to take account of the limited power handling capacity of the dome tweeter and the very small volume behind it, both of which advise against operation at lower frequencies. To implement the optimised crossover, depicted in Figure 7c, capacitors with high-quality foils (PE, 100 V) and coils with air core (0.1 mH) and ferrite core (1.2 mH) were used. The electronic components were mounted on a round circuit board and screwed to the plastic holder on the back of the magnet. Banana plugs facilitate the connection via 0.75 mm 2 cables. Figure 7b displays the resulting transfer functions with adopted energy distribution between the drivers. The rolloff reaches the desired target of 24 dB/octave. However, in the stop band region, the similar magnitude patterns of the two drivers in the direct vicinity of the crossover frequency lead to some remaining deviations that cannot be addressed by the crossover network.

Spectral equalisation
The top panel of Figure 8a displays the results of the on-axis loudspeaker frequency response measurements without FIR filters. The loudspeakers exhibit a mean sensitivity of 69 ± 2 dBV/m (l ± r, range: 61-74 dBV/m) between 60 Hz and 20 kHz and a distinct energy increase around the port resonance frequency of about 92 Hz. Further deviations from a linear frequency response can be observed at the crossover frequency of 2 kHz, for reasons mentioned in Section 3.2.1, and around 11 kHz, owing to the coaxial driver geometry and associated phase irregular-  ities. With active individual FIR filters, the batch-to-batch variations could be minimised, resulting in a passband sensitivity (À3 dB re mean sensitivity) of 71 ± 1 dBV/m (l ± r, range: 68-72 dBV/m) between 92 Hz and 20 kHz. The transfer function issues related to crossover design and driver geometry effects were largely removed. What remains after FIR filtering are level fluctuations within ±3 dB between 2.4 kHz and 3.7 kHz. This effect can be explained by component-related variations and the high sound pressure behind the textile dome, which has a reduced stability in this frequency range. Figure 8b shows the on-axis sound pressure level transfer function of an example loudspeaker (no. 8) when mounted in the array. Notable deviations from the response measured under optimal conditions start above frequencies of about 300 Hz. Once the wavelength is within the range of the loudspeaker dimensions and smaller, cf. Figure 4, the influence of other array loudspeakers becomes prominent, resulting in distinct peaks and notches, for example, at about 1 kHz and 1.3 kHz due to interference effects. The pattern at 11 kHz is still visible although smeared. Note that the apparent energy drop above 15.6 kHz is most likely related to the directivity pattern of the microphone, which was pointing to the north pole loudspeaker during the measurement. Figure 9 shows the example loudspeaker's directivity, normalised to its on-axis transfer function. All values were mapped to discrete levels of 3 dB and truncated if they were outside the displayed dynamic range. The example array loudspeaker shows a symmetric homogeneous broadband horizontal directivity pattern (±3 dB) within ±10°and exhibits deviations up to +6 dB between frequencies of 10.7 kHz and 11.7 kHz within ±20°. The previously described on-axis dip around 11 kHz can be explained by ring-shaped diffraction at edges around the dome tweeter. This effect leads to destructive on-axis and constructive off-axis interference, lowering and increasing the sound pressure level, respectively, and is clearly reflected by the two white maxima at 10°-35°off axis. However, a complete correction of this dip is not recommended as the radiated sound power in the affected frequency range would increase too much, resulting in unpleasant and excessive highfrequency pronunciation. Instead, it is advisable to accept this irregularity, as it will hardly be audible using broadband signals. Figure 10 displays the BNL as unweighted equivalentcontinuous sound pressure levels L Z,eq in octave bands with power amplifiers switched on and off, measured in the centre of the loudspeaker array. An increase of inherent noise levels when activating the power amplifiers can be observed in particular for octave bands with centre frequencies above 500 Hz, overall still falling below the NR15 curve, while exceeding the NR10 curve in octave bands above 2 kHz [78]. The total sound pressure level changes from about 26 dB to 28 dB.

Spatial audio reproduction
The results of the performance metrics using panningbased spatial audio reproduction methods are shown in Figure 11 for VBAP, MDAP and the selected HOA decoder strategies (N = 7), which were calculated based on the ideal and actual loudspeaker layouts. A statistical analysis was not carried out, as even the smallest differences in means/medians would lead to significant but perceptually negligible results due to the large sample size and the resulting power of the applied tests.

Vector base amplitude panning
As expected due to the normalisation of loudspeaker weights, both VBAP and MDAP decoders exhibit  direction-independent overall energy. Not surprisingly,r V andr E show very values when using VBAP since the decoder relies on a loudspeaker subset selection and thus shifts the velocity as well as the energy vectors in the desired VSS direction. Due to the underlying principle, the MDAP decoder results in a higher variance and generally decreased values inr V . The same tendency can be observed inr E , although not as pronounced for the variance.
In terms of V , VBAP outperforms MDAP, which is at the expense of an increased directional energy error E . This shortcoming is specifically addressed by MDAP, resulting in minimal and direction-independent errors E , which in turn increases the directional velocity error V . Due to the high values inr E , a perceived VSS width a of about 15°on average is predicted for the VBAP decoder, the lowest value of all evaluated decoders. The example MDAP configuration entails an increased median value of 16.5°, which is close to the predicted perceived VSS width of the evaluated ambisonic decoders. Compared to the ideal layout, the actual loudspeaker positioning errors have no effect worth mentioning on the performance metrics using VBAP and MDAP decoders.

Higher-order Ambisonics
Apart from considerable differences across HOA decoder strategies but very similar results across ideal and actual loudspeaker layouts, the smallest overall mean energy can be observed for the EPAD decoder followed by MMAD and, at a greater distance, the SAD and All-RAD+ implementations. The energy variances are comparable with slightly higher variations using the SAD decoder. For SAD and MMAD, however, it should be noted that directions below the lowest loudspeaker ring suffer from missing energy, while both EPAD and AllRAD+ manage to widely maintain the energy even for these directions at comparable levels as for directions lying on the spherical cap.
The directional performance in terms ofr V andr E is also competitive between all HOA decoders, with slightly decreased velocity vector magnitudes and increased energy vector magnitude variance using AllRAD+ and EPAD, respectively. The directional errors in terms of V show similar across-decoder behaviour, with AllRAD+ performing best with low variance. All decoders have similar mean energetic directional errors E with increased and comparable variance ranges between EPAD and AllRAD+. The median values of perceived VSS width a lie around 18°f or all decoders with the largest variance for EPAD. The directional error for decoders based on the actual loudspeaker positions is noticeable by (slightly) increased median errors in V and E (SAD, MMAD, EPAD) and a change in variance, the latter surprisingly not always for the worse.

Acoustic crosstalk cancellation
The optimal and achievable channel separation results for the evaluated subset layouts, cf. Figure 5, and the full layout are shown in Figure 12. Table 1 presents the corresponding mean channel separation values, broadband and evaluated for low and high frequency ranges. For each doubling of the number of loudspeakers, the optimal and achievable channel separation increases on average by Figure 9. Horizontal directivity pattern of one example array loudspeaker, normalised to its on-axis frequency response. The magnitude spectra were smoothed using filters with constant relative bandwidth of one-sixth octave. 6.8 dB and 2.8 dB, respectively. The optimal channel separation ranges from 40.9-74.7 dB, while decreasing considerably by 34.3 dB on average to 18.2-32.3 dB in the unmatched scenario. Apart from a decrease, a generally reduced variance of achievable channel separation values is observable. The reason for the substantial drop in channel separation can be traced back to the acoustic influence of the experimental setup construction. Saebø [79] and Kohnen et al. [80] demonstrated decreased filter performance in case of additional reflections. The latter authors presented a CTC filter approach that aimed at compensating for reflections up to the second order and thus increasing channel separation in reflective but geometrically simple acoustic environments. Although positive effects were found in the simulated scenario, the approach could not bring about substantial improvements when applied to measured spatial transfer paths. According to the authors, this is due to the increased acoustic complexity, which cannot be fully described by the geometric acoustic model used in the simulations. Transferred to the present loudspeaker array, Figure 11. Performance metrics for different panning-based decoder strategies. All ambisonic decoders were calculated for an order of N = 7. The results are based on the ideal and actual sampling layouts, represented by grey and black line colours, respectively. Box plots display medians and interquartile ranges with whiskers covering 1.5 times the interquartile range without showing outliers. Figure 12. Optimal and achievable channel separation for the evaluated subset layouts, presented in Figure 5, and the full layout, using an L-CTC approach with a regularisation factor of 1eÀ3. Dots and error bars indicate broadband means and standard deviations, respectively, which were calculated for a frequency range of 90 Hz-20 kHz. more complex acoustic effects such as scattering or diffraction are also present and further reduce the effectiveness of the CTC filters.
It is noteworthy that in setups with sixteen or more loudspeakers, both optimal and achievable channel separation increase faster in the high frequency range than that in the low frequency range. The high frequency range thus accounts for the major share of the broadband performance, which can be explained by the concept of L-CTC. The resulting CTC system matrix represents an underdetermined system for L > 2 and is optimised for minimal energy. Increasing the number of loudspeakers favours a better energy distribution. This usually results in flatter CTC filter curves that require less gain-limiting regularisation, thus preserving the level of detail in the high-frequency structure of the playback HRTFs.

Electroacoustics
Due to their largely flat frequency responses under optimal conditions and the covered frequency range, the array loudspeakers are useful for modelling reference point sources in localisation experiments [34] and the theoretically required local sampling of a surrounding sphere for ambisonic reproduction. However, the acoustic influence of other array loudspeakers and the mounting construction cannot be avoided and affects the optimal FIR-filtered loudspeaker responses, although the array was mounted in a highly optimised environment. As a practical side effect, the homogeneous directivity pattern around the on-axis direction subordinates the individual loudspeaker orientations compared to the importance of their spatial positions. This also favours negligible changes in spectral magnitude responses for off-centre listening in case of a moving listener. Sufficient shielding from external noise in combination with the low inherent noise levels and high output signal-noise-ratio render the system a suitable environment for sensitive measurement applications and listening experiments. Since the noise floor lies below the NR15 curve, the experimental setup additionally fulfils the requirements for perceptual assessment of audio systems as per ITU-R BS.1116-3 [81] regarding maximum permissible BNL.

Vector base amplitude panning
Although originally designed for artistic applications without claiming physical correctness [16], the results of the VBAP and MDAP decoders motivate perceptiondriven reproduction. For real-time auralisation with hybrid room acoustic simulation approaches, vector-based panning techniques represent a flexible implementation approach for direct sound and early reflections [82]. However, one should not ignore the perceptual deficiencies when using VBAP [23] and instead consider MDAP with applicationdependent spreading parameter settings. In such a dense loudspeaker arrangement, it would also be an option to use discrete loudspeaker reproduction for the simulation of late reflections [83], as their exact directions of incidence may be perceptually less important.

Higher-order Ambisonics
In general, the observed differences in the directional decoder energy must be taken into account when calibrating the playback level for studies that apply different decoder strategies. If only selective directions of sound incidence are used the calibration should also be tailored to this subset. However, the use of sampling and mode-matching decoding will lead to noticeable loudness variations in the uncovered spatial range [55]. For auralisations that require an all-round incidence of sound waves, decoder strategies such as EPAD or AllRAD+ should thus be preferred to the others, as they are specially designed for irregular layouts. Using the latter strategy, accurate directional performance in terms of minimised loudness variation, correct direction of plane wave incidence and widely direction-independent spatial extent of VSS can be expected. The actual positioning errors of the loudspeakers seem to be within an acceptable range, since an effect on the decoder performance is only most clearly visible for the SAD.

Acoustic crosstalk cancellation
The differences between optimal and achievable channel separation once again point out the necessity of an Table 1. Mean optimal and achievable channel separation with standard deviation (l ± r) for the evaluated subset layouts, presented in Figure 5, and the full layout. The broadband values were calculated for a frequency range of 90 Hz-20 kHz.
To check the applicability of the system, the achievable channel separation results need to be compared to the minimum required. Parodi and Rubak [73] used different stimulus types in a 2-loudspeaker CTC setup, virtually reproduced over headphones, with varying loudspeaker span angles. Based on the results of perceptual tests, they suggested average minimum values around 20 dB for speech, broadband and narrowband noise with centre frequencies below 1 kHz when the listener is located at the nominal centre position. They also reported decreased thresholds of 15 dB for narrowband noise with centre frequencies at 1 and 2 kHz. The most sensitive thresholds of 25 dB were found for centre frequencies above, which was explained by the sensitivity to manipulations of interaural level differences.
According to the current mean broadband results, the proposed minimum channel separation [73] is largely achieved when using sub-layouts with eight and more loudspeakers, representing configurations that also allow complete rotational freedom of the listener in dynamic systems [7,45]. If necessary, a further improvement of channel separation towards higher frequencies can be achieved by covering the loudspeaker cabinets and the mounting construction with absorbent foam. Reducing the regularisation during CTC filter calculation also leads to an improvement in frequency regions which are characterized by notches in the playback HRTFs. It should be noted, however, that too low a regularisation factor may push the loudspeakers to their physical limits, leading to increased non-linearities. In addition, the narrowband peaks are likely to cause filter ringing and colouration. Although only weak correlations with localisation performance in the saggital plane were observed, channel separation could be a useful predictor of localisation performance in the horizontal plane in matched and mismatched CTC systems [7,68].

Limitations
The evaluation of panning-based reproduction methods based on the analysis of the decoder-dependent loudspeaker gains, including loudspeaker positioning errors, only allowed to roughly estimate the performance of the implemented system without accounting for the acoustic influence of the physical setup. A more comprehensive evaluation would require physically sampling the sweet sphere with a microphone array, facilitating a direct comparison of the reproduced sound field with the synthesis target. Alternatively, such an evaluation can be carried out using a sound field reconstruction method based on plane wave decomposition [84] or point source expansion [28]. Furthermore, a measurement-based derivation of energy and velocity vectors [85,86] potentially provides a more reliable prediction of the overall in-situ performance across decoders.

Conclusion
We presented the design and implementation of a surrounding 68-channel spherical cap loudspeaker array. Commonly used spatial audio reproduction methods were assessed based on various performance metrics to test their suitability and practicality. For amplitude panning approaches, recommendations regarding decoder selection were derived on the basis of simulation results. We also suggested suitable subset layouts for loudspeaker-based binaural reproduction, allowing sufficient channel separation, based on in situ measurement results. Collectively, the results indicate that the implemented system provides a good basis for objective and perceptual evaluations of VAEs created by means of the presented spatial audio reproduction methods.