Issue |
Acta Acust.
Volume 7, 2023
Topical Issue - CFA 2022
|
|
---|---|---|
Article Number | 64 | |
Number of page(s) | 22 | |
DOI | https://doi.org/10.1051/aacus/2023056 | |
Published online | 08 December 2023 |
Review Article
Acoustic research for telecoms: bridging the heritage to the future
1
Orange Labs, 2 Avenue Pierre Marzin, 22307 Lannion Cedex, France
2
Président du Centre de Découverte du Son, Kerouspic, 22140 Cavan, France
* Corresponding author: rozenn.nicol@orange.com
Received:
27
January
2023
Accepted:
18
October
2023
In its early age, telecommunication was focused on voice communications, and acoustics was at the heart of the work related to speech coding and transmission, automatic speech recognition or speech synthesis, aiming at offering better quality (Quality of Experience or QoE) and enhanced services to users. As technology has evolved, the research themes have diversified, but acoustics remains essential. This paper gives an overview of the evolution of acoustic research for telecommunication. Communication was initially (and for a long time) only audio with a monophonic narrow-band sound (i.e. [300–3400 Hz]). After the bandwidth extension (from the wide-band [100–7000 Hz] to the full-band [20 Hz–20 kHz] range), a new break was the introduction of 3D sound, either to provide telepresence in audioconferencing or videoconferencing, or to enhance the QoE of contents such as radio, television, VOD, or video games. Loudspeaker or microphone arrays have been deployed to implement “Holophonic” or “Ambisonic” systems. The interaction between spatialized sounds and 3D images was also investigated. At the end of the 2000s, smartphones invaded our lives. Binaural sound was immediately acknowledged as the most suitable technology for reproducing 3D audio on smartphones. However, to achieve a satisfactory QoE, binaural filters need to be customized in relation with the listener’s morphology. This question is the main obstacle to a mass-market distribution of binaural sound, and its solving has prompted a large amount of work. In parallel with the development of technologies, their perceptual evaluation was an equally important area of research. In addition to conventional methods, innovative approaches have been explored for the assessment of sound spatialization, such as physiological measurement, neuroscience tools or Virtual Reality (VR). The latest development is the use of acoustics as a universal sensor for the Internet of Things (IoT) and connected environments. Microphones can be deployed, preferably with parcimony, in order to monitor surrounding sounds, with the goal of detecting information or events thanks to models of automatic sound recognition based on neural networks. Applications range from security and personal assistance to acoustic measurement of biodiversity. As for the control of environments or objects, voice commands have become widespread in recent years thanks to the tremendous progress made in speech recognition, but an even more intuitive mode based on direct control by the mind is proposed by Brain Computer Interfaces (BCIs), which rely on sensory stimulation using different modalities, among which the auditory one offers some advantages.
Key words: Telecommunication / Spatial Audio (Wavefield Synthesis – WFS, Higher Order Ambisonics – HOA, Binaural) / Quality of Experience / Electroencephalogram (EEG) / Automatic sound recognition
© The Author(s), Published by EDP Sciences, 2023
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
Telecommunication, and more generally Information and Communication Technologies (ICT), was initially centered on two main lines: communication networks and voice data. Transmitting voice required to develop methods and devices to record, compress and reproduce speech signals. Acoustics was at the heart of all these questions. The pioneering telecom operators (e.g. Bell Telecom Company, Deutsche Telekom, British Telecom, France Telecom, Nippon Telegraph and Telephone) were major players of acoustic research, covering areas as varied as speech and audio signal processing, electroacoustic transducers, room acoustics, sound perception, etc. ICT is a highly innovative field by nature, but the transformations tend to accelerate with time (see Fig. 1). The telecommunication landscape has changed and a lot of operators, manufacturers, providers, research bodies have been involved. Some of them have disappeared or changed their ways to be in-board. At the same time, new actors have taken an important place in the ICT world. France Telecom (since renamed Orange) has been an actor involved in all the evolutions of acoustics and associated skills. It is the reason why a lot of examples given in this paper come from France Telecom. But, as advances exist only if they are shared by the community, examples from other research bodies and companies are also referenced.
![]() |
Figure 1 History of ICT transformations along with the impact on acoustic research. |
Since the fixed-line telephony limited to the frequency range [300–3400 Hz], our way of communicating has evolved with the development of internet and mobile telephony. Virtual and distant meetings (i.e. telemeeting according to [1]), combining audio and visual communication, as well as document sharing, are commonplace.
During the lockdowns imposed by COVID-19, telemeetings have been for a lot of people all over the world the only way of sharing information on a private or professional level. For many of them, it was a new experience. Furthermore, the scope of ICT activities has diversified and now includes radio, television, music or video streaming, and gaming. The latest evolution is the Internet of Things (IoT) [2] in which communication is no longer limited to dedicated equipment (i.e. phones or computers), but is potentially open to all objects in our environment. In this new paradigm, any object may become a connected device. Moreover, it can be equipped with sensors, for instance to measure environmental properties (such as temperature, atmospheric pressure, light etc.), or to capture images and sounds of places, people or animals, etc. The data thus collected can be processed and analysed by Artificial Intelligence (AI) tools (either embedded in local devices or in distant servers) to infer information or to trigger actions, leading to the concept of “smart” environments (i.e. smart building, home or car). Another new key component is the “digital twin” which consists in a digital representation of a real-world physical entity (for instance: a product, a system, a process, a building, etc.). The next step is to erase the border between the real world and its digital copy, thanks to eXtended Reality (XR), which combines Virtual, Augmented or Mixed Reality (VR, AR, MR). In telemeeting, XR can help to reduce the cognitive load (e.g. memorisation) and to improve the overall quality of communication. One challenge is to go beyond audiovisual immersion and to provide multisensory immersive experiences. To achieve this, the Internet of Senses (IoS) has been introduced by Ericsson [3]. This real mutation of telecoms has also resulted in the fact that its actors now go far beyond the framework of historical operators.
What is the place of acoustic research in these evolutions? Starting from a central role in the context of voice communication, its relative weight has decreased, but the auditory modality still offers fruitful solutions. To achieve immersive communication, enhancing the overall audio quality is an essential point [1]. It relies mainly on the audio bandwidth extension and sound spatialization, which requires to revisit the entire audio processing chain (i.e. recording, reproduction, compression, audio stream formats). Methods of perceptual assessment also need to be modified, particularly to account for the spatial dimension, or to evaluate the crossmodal interaction with visual information. All this equally holds true for content delivery services (radio, television, music or movie streaming). In the IoT context, connected objects can be equipped with microphones to monitor sound environments, leading to the emergence of a new field of research: automatic sound recognition [4], which shares a number of issues with automatic speech recognition [5]. However, the problem is broader and more complex, since the recognition target covers all the sounds with the exception of speech (although including forms of human vocalization such as shouts, e.g. cries for help, or laughters), with all their diversity, particularly in terms of frequency spectrum and time properties. Furthermore, for controlling these connected environments, the widespread deployment of voice assistants is the result of successful research into automatic speech recognition (which today benefits greatly from progress in machine learning), in combination with sound processing to overcome adversarial conditions of recording (i.e. low signal-to-noise ratio, reverberation) [5].
Brain Computer Interfaces (BCIs) offer an alternative solution, but based on control by the mind and using brain headsets [6, 7]. In the case of reactive BCI, users are presented stimuli which form a sort of menu of application commands. The challenge is to identify the stimulus on which the user’s attention is focused in order to infer the expected command. Although visual stimuli are often used, auditory stimuli have valuable properties for BCIs.
The present paper is an attempt to illustrate how acoustic research has reinvented itself along with the evolution of ICT activities. After a brief recall of early research focused on speech processing (Sect. 2), work for immersive communication, and more specifically spatial sound reproduction, is presented in Section 3. Then, acoustic interaction in connected environments is described in Section 4, before concluding.
2 Early age: voice communication
In the early days of ICT, acoustic research was focused on speech signals, covering a wide range of topics: speech codecs, speech recognition and synthesis, instrumental and perceptual assessment of speech quality (see Chapter “Telecommunications Applications” in [8]). The challenge was also to provide users with equipments (terminals, networks, etc.) able to ensure a clear and easy communication between people all over the world. Thus, development and acoustic measurement of these devices was another key topic. In the beginning, analogue phones were tested only by human operators performing listening tests (e.g. checking the levels produced by the terminals, or the intelligibility through logatom recognition), before the concept of artificial mouth and ear was created. In order to ensure interoperability of telecommunication networks, contribution to ICT standards, in all aforementioned aspects was also a major activity to drive the adoption of tools and requirements necessary to ensure the best quality for users. The main institutions involved were the ITU-T Study Group 12 (International Telecommunication Union Telecommunication Standardization Sector) [9], ETSI (European Telecommunications Standards Institute) TC (Technical Committee) STQ (Speech and Multimedia Transmission Quality) [10] for European standards, and 3GPP (3rd Generation Partnership Project) [11]. While technologies evolve significantly, human beings remain almost the same. To ensure that technologies are still adapted to them, the knowledge of human performance and behaviour also considerably improved on physiological, psychoacoustical, sociological, or neuronal aspects. As an example, the development of artificial heads aims at simulating as far as possible the shape and performance of real humans, including their reaction facing new services, leading to creating new subjective and objective test methods able to define and check the best interactions between human beings and devices or services.
Advances in technologies and usages, from the band-limited [300–3400 Hz] analogue phones to terminals able to provide full-band [20 Hz–20 kHz] conversations in full double speech, have led to a parallel evolution of the acoustic and speech processing in terminals, as well as evaluation protocol (e.g. measurement equipment and test signals) and the related criteria of assessment (instrumental and perceptual). A first step in this evolution has been the introduction of wide-band speech [10–7000 Hz]. As a consequence, a low-impedance receiver was used to support the fine transmission of low frequencies. Anechoic rooms (Fig. 2) became essential for acoustic measurements, particularly with the introduction of hands-free functions. Initial tests for handset terminals were using stationary signals, such as long-term spectrum of speech or pink noise. However, hands-free terminals are implemented with speech detectors, echo cancellers, switching functions, for which stationary signals are no more relevant. Consequently, new test signals based on speech sequences were created, until real speech signals were used, in combination with appropriate measurement methods involving sophisticated signal processings. With the high-impedance and telephony-band phones, the loudness rating (defining the perceived level of the signals) was calculated with reference acoustic leakages. In parallel, artificial ears were defined. The ITU-T Study Group 12 conducted Round-robin tests aiming at defining or checking artificial ears, mouths, heads, signals, etc., with world-wide collaborations between companies and network operators. Digital phones (e.g. Integrated Services Digital Network, ISDN) were assessed through reference coding systems and needed new test parameters such as “out-of-band”, etc. Later, the development of mobile phones called also for new criteria and test interfaces, especially to meet the need that test equipments and signals becomed as close as possible to the human ones (artificial heads, speech-like signals until the use of sequences of real speech). These progresses have also opened the way to new terminals implementing super wide-band [50 Hz–14 kHz] and full-band, leading to modification of devices in many ways. Standardization was also oriented for new implementations, such as in cars, calling for new test methods, in particular to check the operation of terminals in presence of realistic environmental noises. Connected car is a good example of the acoustical challenges met, specially for hands-free terminals, due to the cockpit acoustics, the sound environment (internal and external noises), the relative position of passengers and interfaces, and the essential use of powerful signal processing, in order to ensure the speech level required for a good intelligibility in the car and for distant users. In the 1990s, the emerging concept of audio and video conferencing has led to extensive research on several acoustic issues [12, 13]: sound recording (e.g. microphone array) [14], source localization and tracking, acoustic echo cancellation [15, 16], sound reproduction [17]. More recently, the questions of loudness calculation (implementing a more realistic and applicable method than loudness ratings) and listening effort in noisy environments (as a first way to test intelligibility) have been standardized [18]. Even if these significant evolutions have been done for the benefits of most of the users, there is still a need for a better design for users with specific impairments.
![]() |
Figure 2 Anechoic chamber of Orange Lab in Lannion (© Orange DGCI). |
3 Immersive communication
3.1 Spatial audio and sound immersion
Physical distancing recommended during the COVID-19 pandemic has led to a widespread use of videoconferencing, and at the same time has highlighted the limitations and flaws of available solutions. Participation in telemeetings over several hours per day has generated an increase cost in terms of fatigue and cognitive load, particularly because auditory scene analysis is impaired by the low audio quality (e.g. lack of spatial cues), and also because non-verbal communication is missing. These observations have triggered a renewed interest in ICT research and industry to improve the QoE (Quality of Experience) of videoconferencing systems [19]. However, it was more than thirty years ago that the concept of telepresence [20–22] paved the way to immersive communication, aiming at giving the illusion to distant participants that they are in the same room. Spatial audio reproduction and audiovisual presentation were identified very early as key assets to achieve perceptual immersion for telepresence [20, 21]. Indeed, studies investigating the ergonomic benefits of spatial presentation of speakers in conversational situations have shown a significative improvement in terms of speaker recognition and separation, memory, intelligibility, cognitive load, and overall QoE [1, 19, 23]. The effect is all the stronger as the number of speakers increases. It is due to differences in spatial cues between the speakers that the attenders have better perceptual performances. Furthermore, interaction between distant participants is judged more natural [19]. As a result, spatial audio (or 3D sound) became an active topic of research in the late 1990s. Technologies such as Wave Field Synthesis (WFS) [24–26], Ambisonics [27] and its generalization Higher Order Ambisonics (HOA) [28, 29], Vector Base Amplitude Panning (VBAP) [30] or binaural (initiated by Prof. Blauert and his team in Bochum University) [31–33] were investigated, as well as the perception of spatial sound. First applications concerned audio and video conferencing, then spatial audio was used for the enhancement of audio and audiovisual content (i.e. music or video streaming, radio, television), which came into the ICT perimeter. The advent of 3D audio in telecoms required to upgrade all the sound processing chain, considering the questions of spatial audio recording, reproduction, compression, as well as the underlying concept of 3D audio format, which are described in the next subsections.
3.2 Spatial audio formats
3D audio formats relate to the representation of a sound scene by a set of signals, for storage or transmission purposes [34]. The main problem is the proper encoding and decoding of spatial information. For instance, in stereophony, the sound scene is encoded as 2 signals (i.e. left and right channels). Amplitude and time differences between these signals allow to move the virtual sound source (or phantom source) between the left and right loudspeakers. The binaural format is also based on 2 signals, but the spatial encoding relies on the localization cues used by the auditory system (i.e. Interaural Time Difference or ITD, Interaural Level Difference or ILD, spectral cues [35]). Binaural spatialization aims at reproducing the acoustic wave at the entrance of the left and right ears and is the closest method to natural listening.
In the original concept of WFS, spatial information is encoded by the amplitude and phase values of acoustic waves along a surface, in reference to Huyghen’s principle and following a process analogous to holography [24]. The sound scene is thus represented by a high number of signals, depending on the size of the recording surface and the accuracy of spatial sampling. Similarly, the HOA format is derived from a physical representation of the soundfield, corresponding to its expansion over Spherical Harmonics (SH) functions (i.e. the eigenfunctions of the acoustic wave equation in spherical coordinates). This provides a spatial analysis which is analogous to a Fourier series decomposition, but for spatial variations. The weights of the SH expansion are defined as the HOA components of the soundfield representation, i.e. the spatial spectrum in the HOA domain. It should be noticed that this representation is hierarchical: SH components can be grouped in ascending orders, with each higher order providing a more refined level of spatial information. HOA encoding up to the Mth order leads to (M + 1)2 signals. A variant of the conventional HOA format has been introduced as the Near Field Compensated-HOA (NFC-HOA) format [36], allowing to overcome the initial restriction of Ambisonic theory to plane waves and generalizing it to spherical sound sources.
As a result, 3 main categories of 3D audio formats are usually distinguished: channel-based (e.g. stereophony, multichannel surround such as 5.1 or 22.2 standards), scene-based (e.g. HOA), and object-based format (e.g. Dolby Atmos) [37]. The main properties expected from a 3D audio format are: compactness (i.e. the best compromise between the number of signals and the extent of spatial information encoding), interoperability with other formats and interactivity (i.e. the ability to edit a sound scene once encoded, for instance to modify the position of sound sources, or to discard one component, typically noise). In stereophony, spatial information is restricted to the space within the left and right loudspeakers, whereas in binaural, WFS or HOA format, the whole 3D space around the listener is spatialized. Furthermore, each direction is addressed equally, which means that no spatial area is favoured or discarded. Binaural encoding is the most compact. However, setting aside the special case of dynamic binaural synthesis (i.e. real-time sound rendering in combination with the simulation of binaural recording), modifying the scene after it has been recorded is not possible.
In contrast, the HOA format soon proved to be advantageous in many respects [38]. This is probably one of the reasons why WFS was overshadowed in the 2010s. First it is universal in the sense that any acoustic wave can be encoded in HOA format. Second it is generic, meaning that it is a core format which is independent of both the recording and reproduction systems. Third the HOA format is scalable, offering a hierarchy of representation levels (corresponding to the SH series). Low-order representations allow a full description of the sound scene, although the spatial accuracy is low. Higher orders only refine spatial information. Thus, if it is needed to reduce the bitrate for transmission purpose, higher orders may be removed without altering the overall description of the soundfield. Only the spatial accuracy is decreased. Moreover, once encoded in HOA format, spatial information may be modified: for instance sound sources may be rotated, or the sound scene distorted. Spatial filters in HOA domain can also be implemented to further edit the sound scene (e.g. directional filtering [39], or room equalization [40]).
3.3 Spatial sound recording
Since a long time, stereophonic pairs (AB, XY, MS, etc.), then multichannel trees, have been used for recording and radiobroadcasting sound scenes [41]. The result is a first step of immersive reproduction, but restricted to an horizontal plane in front of the listener. Stereophony and multichannel sound were designed primarily for music recording, involving aesthetic choices which may not fit the needs of immersive communication. For instance, sound recording is focused on the frontal space, whereas immersive communication requires full 3D recording without favouring any particular direction. There are only two systems fulfilling this requirement: binaural recording by an acoustic manikin (e.g. a dummy head) [31] and soundfield recording by Ambisonics or HOA microphones. For instance, the “Soundferences of Trégor” (in french: “Sonférences du Trégor”) are live conferences aiming at sharing knowledge about acoustic engineering or research, with the original feature of being recorded and broadcasted in binaural (Neumann KU100 dummy-head) for a better QoE [42].
HOA is an attractive alternative because of its underlying format. The first Ambisonic microphone was invented by Craven and Gerzon [43]. It is called the “Soundfield” microphone and is composed of 4 cardioid microphones arranged in a tetrahedron, allowing to extract the HOA components up to the first order. In order to record components of higher orders, HOA microphones based on Spherical Microphone Arrays (SMAs) [44–46] were developped [47, 48]. Designing these HOA microphones is a delicate compromise to fulfil several requirements. Firstly, the geometry of the SMA (i.e. sphere radius, number of microphones, sensor arrangement) needs to be properly defined, not only with respect to spatial sampling, expected frequency bandwidth, and maximal SH order, but also to ensure an accurate computation of HOA components. Particularly, spatial sampling must conserve the orthonormality property of SH functions. Secondly, HOA recording includes a post-processing step, in which microphone signals are encoded by a set of filters into HOA components. These encoding filters may suffer from instabilities, which can be compensated for by adjusting the microphone directivity, even by introducing diffracting structures [49]. One drawback of HOA microphones is the high number of microphones and associated signals, which is not compatible with handheld devices, such as smartphones. Stereophonic or binaural microphones dedicated for smartphones or tablets exist. For Ambisonic recording, as suggested in [50], the strategy based on SMAs may be replaced by a setup composed of only 2 or 3 sensors, allowing to encode spatial information as inter-channel time and amplitude differences by adjusting the spacing and directivity of the microphones. Moreover, it is possible to generalize this concept and to exploit the large number of microphones that smartphones are equipped with (e.g. Metadata-Assisted Spatial Audio format proposed by Nokia [51]).
In 2009, a prototype of HOA SMA (composed of 32 microphones and allowing 4th order HOA encoding was used for a live retransmission of a concert performed at the opera house of Rennes (see Fig. 3). Mozart’s Don Giovanni was played in Rennes, and simultaneously broadcasted on a wide screen outdoor in front of the town hall and in distant concert halls (in Paris and Brest), as well as on radio and television [52]. Sound spatialization was recognized as a key driver of the immersive experience for the remote audience. One of the ambitions of this operation was to democratise access to opera houses and to share the musical event with people who do not have access to it for various reasons (ticket cost, people in hospital or in prison, etc.). Following the success of the first edition, the event was repeated several times. The Eigenmike® (see Fig. 3) was the first HOA microphone commercially available from mh acoustics [53]. In its initial version, it consists of a 32-microphone SMA (4th order HOA encoding). Now, a new version with 64 microphones providing HOA encoding up to the 6th order is proposed. More recently, Zylia launched the ZM-1 microphone (19 digital MEMS, HOA encoding up to 3rd order, see Fig. 3) [54]. Beyond 3D audio recording, the Eigenmike® has demonstrated its usefulness for the measurement of spatial room impulse responses [55, 56]. It can also be used for source localization, such as in [57], which shows that not only the direction of arrival, but also the distance of sound sources can be inferred. Instead of trying to get rid of reverberation, the method takes advantage of it, by exploiting the multipath propagation of acoustic waves. Furthermore, in the case of a moving source (or potentially a moving HOA microphone), the spatial diversity provided by the movement benefits the estimation of room geometry.
![]() |
Figure 3 Examples of HOA SMAs. From left to right: 3 prototypes developped by Orange (© J. Daniel), the Eigenmike® by mh acoutics [53] and the Zylia ZM-1 microphone [54]. |
3.4 Spatial sound synthesis
The alternative to natural soundfield recording is spatial sound synthesis. For instance, sound sources are recorded by monophonic microphones (i.e. proximity recording) and then a virtual recording (e.g. stereophonic, binaural, WFS or HOA) is simulated. This strategy offers a great flexibility in designing the sound scene. Sources can be spatialized either at their exact location, or at arbitrary positions. Thus, in a virtual meeting, the spatial separation between the speakers can be increased in order to enhance speech intelligibility and scene lisibility. Besides, the sound scene may be personalized by users and adapted to their taste and needs (e.g. auditory impairments). Instead of single microphones, microphone arrays can be used with the advantage of high and variable directivity, allowing to improve the signal-to-noise ratio and Acoustic Echo Cancellation (AEC) [13]. The pointing direction can also be changed by signal processing.
3.4.1 Binaural synthesis
Binaural synthesis is one particular example of spatial sound synthesis, which has aroused a great deal of interest, since it provides a simple and potentially very effective way to create 3D audio [35]. Furthermore, with the growing success of smartphones, which are now used for much more than just voice communication, including music, podcast or movie streaming, enriching their audio rendering with spatial sound has become an essential challenge. Binaural spatialization was immediately identified as the most suitable method for them, because it allows to render 3D sound scenes using only headphones. In this respect, it has prompted a considerable amount of research. In binaural synthesis, the monophonic sound source is convolved by a pair of filters corresponding to the acoustic transfer function between the source and the entrance of the listener’s left and right ear canals.
This transfer function is strongly determined by the interaction between the acoustic waves and the listener’s morphology and is defined as the Head Related Transfer Function (HRTF) [35]. HRTFs can be measured by dedicated facilities in an anechoic chamber (see Chapter “Trends in Acquisition of Individual Head Related Transfer Functions” in [58]). They may also be measured in a room, in which case they include reverberation and are called BRIRs for Binaural Room Impulse Responses. The objective is to collect the binaural acoustic imprint of a place, or to capture the acoustic and spatial properties of a sound reproduction system. For instance, sound engineers can take their postproduction studio in their pocket with only a set of BRIRs. If listeners are equipped with a head-tracker (e.g. gyroscopic sensors or optical camera) which monitors their head movement, the binaural synthesis is said to be dynamic: sound sources remain static regardless of their head rotation [59]. The direction of virtual sound sources is updated in real time as a function of listener’s movements. On the contrary, without head-tracking, the whole sound scene rotates whenever the listener moves the head. HRTFs being measured for a finite number of directions, dynamic synthesis requires HRTFs to be spatially interpolated in the missing directions. Head movements coupled with head tracking allow the listener to exploit dynamic cues of localization [35], and has been shown to improve localization performance and externalization [60]. However, the quality of dynamic rendering is affected by the accuracy and latency of the head-tracker.
3.4.2 HRTF measurement and databases
One straightforward way of obtaining individual binaural filters is by acoustic measurement [61–64], but this method is very intrusive for the subject who must remain perfectly still for the entire measurement, which can be painful if the session lasts one hour or two [62–64]. The measurement time is related to the number of measurement directions, which can range from a few hundred to several thousands if the full 3D sphere is covered, and depends on the desired spatial resolution. Today many HRTF databases are available. A large number of them are publicly available and include morphological data: CIPIC (1250 directions) [62], LISTEN (187 directions) [64], ARI (1550 directions) [65], FABIAN (FABIAN manikin, 11,345 directions) [66]. Morphological 3D-scans may be associated with acoustic data: FIU (72 directions) [67], RIEC (865 directions) [68], SYMARE (393 directions) [69], ITA (2304 directions) [70], HUTUBS (440 directions) [71], SONICOM (828 directions) [72]. A question of interest is the minimal number of directions needed to sample spatial variations of HRTFs on the 3D sphere with sufficient accuracy, particularly if spatial interpolation is subsequently applied (e.g. for dynamic rendering). In a study based on a very discriminative experimental paradigm (i.e. Three-Alternative Forced Choice test comparing measured and interpolated HRTFs), this number was estimated to 1100 [73]. However localization tests suggest that it is only from 150 measured directions that interpolated HRTFs give rise to significant errors of spatialization [74, 75]. HRTF interpolation based on SH requires a sampling grid with specific properties. For instance, in [77], a Gaussian grid composed of 1680 directions (6° of resolution in azimuth and 5.9° in elevation) is designed to evaluate HRTF spectrum in the SH domain up to order M = 29, which is dictated by the head size (radius fixed to 10 cm) and an upper frequency limit of 16 kHz. This grid was used for the 2 databases created during the BiLi (Binaural Listening [76]) project: BiLi-IRCAM [77] and Orange [78], this latter incorporating 3D morphological data for 45 individuals [79].
For a long time, the diversity of HRTF representation formats in databases made them difficult to use, until the proposition of an interchange format to store HRTFs: the standard AES69. Driven by the BiLi project, it was published by the Audio Engineering Society (AES) for the first time in 2015, in the purpose of facilitating the exchange of HRTFs and offering full compatibility between binaural renderers, in order to foster the dissemination of personalized binaural audio solutions [80]. This version was followed by two updates in 2020 and 2022. It is based on the format SOFA (Spatially Oriented Format for Acoustics) which aims at representing any spatial acoustic data (e.g. directional room impulse responses measured with a microphone array excited by a loudspeaker array) and was initiated in the context of the AABBA (Aural Assessment By means of Binaural Algorithms) consortium [81]. Now most HRTF databases provide data in the SOFA (i.e. AES69) format [82].
Another challenge of HRTF measurement is the reduction of the measurement time to a few minutes or a few tens at most ([72, 78, 83–85], see also Chapter “Trends in Acquisition of Individual Head-Related Transfer Functions” in [58]). A first solution relies on continuous measurement while the subject is rotated in the center of a loudspeaker array [84]. Time-varying HRTFs are extracted by Least Mean Square-type adaptive filtering. The measurement time corresponds to the duration of a continuous rotation, which can be only a few minutes depending of the rotation speed. However the slower the rotation, the better the accuracy. Another strategy is based on Multiple Exponential Sweep Method (MESM) using a clever combination of sweep overlapping and interleaving with optimal delays [86]. This method is implemented in the HRTF measurement systems described in [72, 78, 85, 87]. For instance, the setup presented in [78] is composed of 2 vertical arcs of sphere, along which 31 loudspeakers are arranged (Fig. 4). Measurement is performed azimuth by azimuth. For a given azimuth, HRTFs of all elevations are measured at once, resulting in a measurement time of around 20 min per individual for 1680 directions.
![]() |
Figure 4 System of HRTF measurement installed in the anechoic chamber of Orange Lab in Lannion. |
A last critical aspect of HRTF measurement is the requirement for specific facilities, i.e. an anechoic soundproof room and equipments (e.g. loudspeaker array, infrared cameras, tracking devices, etc.) allowing to measure a plurality of directions around the listener with the finest spatial accuracy.
Therefore the concept of “low-cost measurement” of HRTFs has been explored. HRTFs may be measured in degraded conditions (i.e. in a non-anechoic environment, or by using non-ideal microphone and loudspeaker, or even with stimuli which does not fully meet the requirement of acoustic measurement, etc.). For instance, solutions based on sparse spatial sampling in combination with data modelling (e.g. neural networks) and knowledge taken from HRTF databases have been investigated [88–90].
More recently, an original method proposes to exploit individual binaural recordings, which are considered as “binaural selfies” in the sense that the recorded signals carry within them the acoustic imprint of the listener’s individual morphology [91, 92]. This represents an extreme case of distorted conditions of HRTF measurement. Using conventional processing, it would be virtually impossible to extract HRTFs from the recorded signals, yet the study shows how machine learning can be used to infer personalized HRTFs. Obviously, in the current state of their development, these solutions do not claim to rival the reference method of HRTF measurement in terms of accuracy, but they do offer a first level of HRTF personalization based on acoustic recording and compatible with large-scale distribution.
3.4.3 Personalization of binaural filters and synthetic HRTFs
In the context of mass-market implementation of binaural synthesis in ICT products and services, measuring the HRTFs of each user in an anechoic chamber is not feasible. Generic HRTFs, taken from a database and corresponding either to a dummy-head or a real person, must be used instead, leading to non-individual binaural filters. However, the spatial encoding resulting from the interaction between the acoustic waves and the listener’s individual morphology is highly specific to this individual. HRTF properties (e.g. frequency location and amplitude of peaks and notches of their spectrum response) strongly vary from one individual to another. Futhermore, psycho-acoustic studies have shown that, when listening through non-individual HRTFs (i.e. binaural recording by a dummy-head, or binaural synthesis using the HRTFs of another individual), several audible artefacts are observed: poor externalization, poor localization in elevation, and an increased rate of front-back and top-bottom reversals [93]. Nevertheless, it has also been shown that a listener is able to learn the HRTFs of another individual [94], but this reprogramming of sound localization may be quite long (several weeks in [94]). Moreover, defining the appropriate protocol of HRTF training is tricky [95]. On the contrary, if individual HRTFs (i.e. HRTFs measured for the listener) are used, it is possible to fool the auditory system. Virtual sound sources may be not discriminated from real ones, provided that binaural synthesis is properly implemented. More generally, localization performance is close to those measured in the condition of natural listening (i.e. good externalization, small localization errors and a low rate of front-back reversals) [96]. Besides, individual and non-individual binaural synthesis have been compared on the basis of the spatial response fields of neurons in the primary auditory cortex of ferrets, showing that spatial selectivity is altered in the case of non-individual HRTFs, whereas neural activity in the case of individual HRTFs is similar to that observed in presence of real sources [97].
All these observations have prompted research on solutions to personalize binaural spatialization with alternatives to acoustic measurement of HRTFs, one of the main motivations being to enhance the audio QoE of smartphones. Consequently a wide variety of personalization tools have been proposed [98–110], making this question less and less critical today. Such solutions are generally composed of two components [103]: individual data, providing information about the listener’s morphology and which may be of different types (magnetic resonance imagery or laser scanning leading to a 3D mesh of the morphology, photographs, a sample of individual HRTFs, etc.), and a computational model for inferring the HRTF of an individual in a given direction.
One simple way of binaural personalization consists in drawing a set of HRTFs in a database, but the difficulty lies in identifying a simple yet effective strategy (e.g. morphological matching, listening test, etc.) for all listeners (binaural expert or not) to select the HRTF set that suits them best. Limiting the choice to a shortlist of a few HRTF sets is possible [111], by seeking a selection that either is satisfactory for the largest number of listeners (e.g. on the basis of perceptual assessment of spatial trajectories [112]), or brings together the most dissimilar sets [113]. Selecting HRTFs in a database can be a first step which is then supplemented by a tuning process, whereby the binaural transfer functions are adapted to the individual. For instance, HRTF transformation can be driven by morphological comparison [114, 115].
Alternatively, synthetic HRTFs may be obtained by solving the problem of acoustic diffraction with the listener’s morphology by BEM (Boundary Element Method) or FEM (Finite Element Method) [100, 116]. This solution becomes more and more affordable thanks to the acceleration of computing capacities and the technological developments simplifying the acquisition of morphological meshes (e.g. Kinect® technology or 3D reconstruction from 2D pictures [105]). However, a point that remains problematic is the capture of the morphology of pinna due to its proximity to the head which hinders the measurement [79]. Several HRTF databases include such synthetic HRTFs [66, 69, 71].
Another solution of binaural customization is based on linear expansion of HRTFs on a series of reconstruction functions [103, 109, 117]. These latter are obtained by analysing HRTF databases (e.g. by applying Principal Component Analysis -PCA- [103, 109, 118] or Independent Component Analysis -ICA- [115]). The decomposition may be performed either in the spatial domain (i.e. reconstruction functions are directivities), or in the frequency domain (i.e. reconstruction functions are frequency filters). Reconstruction functions are common to all the individuals. An HRTF is then obtained as a weighted sum of these reconstruction functions, in which the weights are specific to the individual, meaning that the personalization effort only focuses on these weights. In [115], it is shown that ICA leads to reconstruction by a filter bank with each unity controlling a portion of the spectrum. Furthermore, the weights obtained for several individuals are directivity functions presenting spatial patterns which are similar from one individual to another.
3.5 Spatial sound reproduction
There are two main ways of rendering spatial audio: either with a loudspeaker array, or with headphones. Early experiments tended to favour loudspeaker systems, since they reproduce a more or less extensive soundfield, into which listeners can immerge themself without constraint, and which can potentially accommodate multiple and moving listeners. However, to achieve a good quality of spatial reproduction and an expanded listening area, a high number of loudspeakers is generally required [25]. In stereophonic or multichannel surround systems, the area of accurate reproduction is limited to one point, i.e. the “sweet point”. Large arrays of loudspeakers are not feasible for mass-market products. Besides, the acoustic waves emitted by the loudspeakers interact with any surface of the room, modifying and altering the reproducing soundfield. An absorbent environment is often preferred. In the same way, the more loudspeakers there are, the more complex the AEC problem becomes in communication applications [13]. For all these reasons, in early research on telemeetings, systems were set up in a dedicated room, i.e. a studio with a controlled room effect (e.g. low reverberation time and very low ambient noise) and equipped with microphones and loudspeakers for optimal performance of sound recording and reproduction. On the contrary, headphones (or earphones) bring spatial audio within everyone’s reach, and under any conditions, allowing an absolute control of the reproduced soundfield at the ear’s entrance, yet with the risk of isolating the listener from external sounds. MP3 players and smartphones have greatly improved headphones acceptibility, so that headphones may be considered as one of the most popular ways of listening to audio content today.
3.5.1 Rendering over loudspeakers
Early research on audio and video conferencing investigated loudspeaker-based reconstruction of soundfield, such as WFS and HOA. Loudspeakers are used as secondary sources, each of which emits a “wavelet” (in Huyghens’ terminology). The amplitude and phase (or more generally time properties) of each wavelet is appropriately controlled so that the superposition of all the wavelets matches the target soundfield. For HOA reproduction, a decoding step is needed to convert HOA components into loudspeaker signals. It is performed by way of a decoding matrix which re-encodes spatial information into the loudspeaker referential, taking into account the loudspeaker setup (i.e. number and position of loudspeakers), under the constraint of an expected quality of soundfield reconstruction (i.e. decoding law). In contrast with the sweet spot of stereophony and multichannel surround, this soundfield reconstruction offers an extensive listening area. Nevertheless, the accuracy of sound reproduction suffers from spatial aliasing which occurs as soon as the wavelength is wider than the loudspeaker spacing. WFS reconstruction for high frequencies is inevitably degraded. For HOA, reproduction accuracy depends on both the frequency and the maximum order of the SH expansion. More precisely, the area of accurate reconstruction is wide at low frequencies, but is shrinking at high frequencies. However, by increasing the maximum HOA order, this area can be extended. Comparing spatial sound reproduction between HOA and WFS shows fairly similar behaviours, although differences exist, both in terms of the soundfield properties and perceptual rendering (see Chapter “Creating auditory illusions with spatial-audio technologies” in [119]).
A WFS prototype based on a linear array composed of 16 loudspeakers spaced by 15 cm was evaluated for videoconferencing [25, 120], showing an accurate reconstruction of the wavefront for low frequencies (i.e. below 1–1.5 kHz). For higher frequencies, an aliased wavefield occurs, resulting in multiple interfering wavelets that follow the direct sound, but which may be partially merged with the reverberant field of the reproduction room. In addition, a localization experiment revealed that WFS achieves effective rendering of sound depth, contrary to intensity panning [121]. One application of this work was the “Telepresence Wall” (in french Mur de Téléprésence [122]), conceived as a virtual window connecting 2 remote sites of a company and inviting people to informal discussions, by analogy with what happens when they meet by chance in a corridor (Fig. 5). An experimental system was in operation between Paris and Lannion during 2001–2005. Giving users the sense of co-presence was a critical issue to come the closest as possible to natural conditions of communication. Video reproduction provided life-size display of people with high video quality. A system of mirrors was used to correct the distorsion of gaze in order to preserve eye contact. As for the audio setup, WFS rendering was implemented in combination with microphone array and multichannel AEC, providing fluent voice interaction. From 2001 to 2003, research on WFS was continued in the European project called CARROUSO [123, 124], which aimed at transmitting 3D sound scenes by using 2 main technologies: MPEG-4 (Motion Picture Experts Group of the International Organization for Standardization) format for the scene description and WFS for the spatial sound reproduction. Another achievement was the successful implementation of Multi-Actuator Panels (MAPs) for WFS rendering [125]. Today, spatial audio is included in telemeeting products such as BlueJeans [126] and MeetMe (proposed by British Telecom) [127]. They both use the Dolby Voice® technology [128]. The IX5000 Telepresence series by Cisco also features 3-channel AAC-LD (Advanced Audio Coding Low Delay) spatial audio [129].
A last question raised by spatial sound reproduction is the interoperability between 3D audio formats. In many ICT products, the reproduction system is not controlled. How to provide a good quality of spatial audio rendering of an audio stream received in HOA format when the listening system is simple stereophony? Conversion tools are available: either upmixing (e.g. to convert a stereophonic 2.0 stream to a multichannel 5.1 setup) or downmixing (e.g. to convert a multichannel 5.1 stream to a 2.0 sterephonic system) [130, 131].
3.5.2 Rendering over headphones
In addition to all the advantages that have been already reported, headphones can be considered as a reproduction way that is compatible with all multichannel audio formats. Indeed, any layout of virtual loudspeakers can be created by using binaural synthesis, with or without head tracking. This possibility was exploited very early on for multichannel 5.1 audio content [132]. Radios and televisions have realized this opportunity to promote multichannel sound to mass market [76, 133]. Moreover, virtual loudspeaker setup is often the only solution to reproduce HOA content. For instance, 4th order HOA requires a spherical array of at least 25 loudspeakers. Such a facility is difficult to have access to.
Like for any transducer, headphone response may alter sound reproduction. However the case of headphones is a little more complex than loudspeakers. First, their transfer functions (i.e. HPTF, HeadPhone Transfer Function) depend not only on their electro-acoustic properties, but also on the acoustic coupling between the speaker housing and the listener’s outer ear, meaning that the individual morphology and the particular way headphones are placed on the head (which is different each time the headphones are worn again) have an impact. For a proper compensation, HPTF must be individually measured, while asking the listener to reposition the headphones several times. An average response is then calculated from the series of repositionings. Furthermore, following standardization recommendations, manufacturers calibrate headphone reproduction on the basis of stereophonic content and in reference to the reproduction quality achieved by a standard 2-loudspeaker setup [134]. Additional criteria are needed to assess the ability of headphones to reproduce binaural sound. Indeed, what is expected in binaural reproduction is that the sound pressure induced by the real source (i.e. freefield propagation) at the listener’s eardrum is identical (within a constant coefficient) to the sound pressure induced at the same location in the situation of a binaural reproduction of this source by headphones, leading to define the ratio of these 2 pressures as a new criteria (i.e. free-field propagation vs binaural reproduction) reflecting binaural specificity [31]. During the BiLi Project, a set of 80 headphones was measured in terms of HPTF, distorsion, transmission loss, electro-acoustic impedance, comfort (mechanical stress on the listener’s head) [135]. The ratio between free-field and binaural pressures was added to specifically evaluate the ability to render binaural sound. Binaural files were also played by each model and simultaneously recorded by a dummy head, for the purpose of future perceptual assessments.
3.6 3D audio coding
Transmission of 3D audio content raises the question of 3D audio compression. Early work considered the coding of stereophonic or multichannel streams [136, 137], but the HOA format, which relies on a potentially very high number of signals (e.g. 25 channels for 4th order HOA), makes this issue even more acute, and was already discussed more than 4 decades ago [138]. More recently, an original strategy aims at taking advantage of the limitations of spatial resolution of the auditory system to reduce the encoded information (i.e. perceptual optimization of spatial audio coding) [139]. The underlying assumption is that, in a complex sound scene, i.e. a scene composed of multiple sources, the auditory system is not able to perceive all the information with equal accuracy. When the listener focuses on a given sound source, competing sources have a distracting effect, resulting particularly in a localization blur. The Minimum Audible Angle (MAA) corresponds to the smallest angle between two positions of the same sound event that the auditory system is able to discriminate. MAAs have been measured for various configurations, allowing to quantify the phenomenon of spatial blurring as a function of the frequency content of both the target and distracting sources, their level, their spatial position and the number of distracting sources. The results were implemented in two multichannel audio coding schemes, one for parametric representation of stereophonic and 5.1 audio, and the other for HOA representation. They rely on a dynamic adjustment of the spatial accuracy in a way that shapes the resulting spatial distortion within localization blur, such that it remains unnoticeable.
In 2015, MPEG released the standard MPEG-H 3D audio (second edition in 2019) [140], which addresses bitrate efficient and high-quality representation of immersive sound signals and is now being adopted for broadcast and streaming applications [141]. It is designed for maximum flexibility: it supports various 3D audio formats (i.e. channel-based, object-based and HOA formats), and is compatible with a wide range of reproduction conditions (e.g. mono, stereo, surround, 3D audio), either for loudspeaker setups or for headphones. MPEG-H 3D audio provides also dedicated toolsets for HOA coding and decoding. An extension of MPEG-H 3D audio to VR and AR applications (future MPEG-I immersive audio) is under development [141]. In parallel, the 3GPP is currently in the process of finalizing the standardization of a new mobile communication codec, named IVAS for Immersive Voice and Audio Services, which introduces immersion (including binaural and HOA up to 3rd order) into traditional voice services, as well as more general immersive multimedia experiences [142].
3.7 Perceptual assessment of spatial audio
A key step in the development of 3D audio technologies is their perceptual evaluation, i.e. the assessment of how the reproduced sound scene is perceived by listeners. Evaluation of perceived audio quality has been an essential part of ICT research since the earliest days. Conventional methods aim at the measurement of Basic Audio Quality (BAQ) [143, 144], where “quality” refers to the fidelity with which a signal is transmitted or rendered by a system (see Chapter “Telecommunications applications” in [8]). Experimental paradigms (see Chapter “Sensory evaluation methods for sound” in [8]) such as the double-blind triple-stimulus with hidden reference [143], paired comparisons, or multiple comparisons with known and hidden references and anchors [144] have been used. However, in these methods, a reference is needed, yet the reference is sometimes not easy to define in the case of spatial sound reproduction [1]. The real audio scene could be chosen as a reference, but it is rarely possible to present it to listeners for direct comparison. Furthermore, the perception of 3D audio scene is highly multi-dimensional (see Chapter “Binaural spatial reproduction” in [8]): not only physical-related attributes (i.e. attributes that represent a physical or a mathematical property of either the sound source – timbre, location, etc.-, the acoustic space or the sound reproduction system) have to be considered, but attributes that are related to the listening experience (e.g. the naturalness or the readability of the sound scene) and the way in which the mind state of listeners is modified by the sound, are of interest [145]. For instance, measuring and analysing the emotions aroused by sounds provide valuable insight (see Chapter “Emotions, associations and sound” in [8]), particularly in terms of immersion. Besides, it may be difficult for naive listeners to understand the exact meaning of a given attribute. For all these reasons, measuring the QoE seems more appropriate than measuring the BAQ [1]. The QoE is defined by the ITU-T as the overall acceptability of an application or service, as perceived subjectively by the end-user. QoE includes the complete end-to-end system effects (client, terminal, network, service infrastructure, etc.). The overall acceptability may be influenced by user expectations and context. To illustrate this topic, this section explains some of the most salient questions with examples giving insight into methods devised to answer them. First the assessment of the fundamental concept of spatial quality is discussed. But this latter is only one dimension of QoE. Evaluating the extent to which the listener is immersed into the virtual scene (i.e. sound immersion) is also of interest and is presented next. Then, cross-modal interactions between audio and video information are explored. Finally, the ecological validity of perceptual assessment is examined.
3.7.1 Spatial quality
When assessing spatial audio systems, the question of primary interest concerns their ability to reproduce virtual sound sources which are spatialized around the listener. The most straightforward way to evaluate this aspect is a localization test, which aims at measuring localization performances, under the assumption that the higher the localization accuracy, the better the spatial sound reproduction. However, one difficulty is to find a suitable method for the listener to report the perceived direction [146, 147]. Moreover, localization performances depend on the individual. Besides, merging all the metrics (i.e. azimuth and elevation error, front-back and top-bottom reversals) to obtain an overall score might be tricky. Instead of localization accuracy, time response may be considered as an alternative measure of spatial quality, as suggested by [148]. The localization test is revisited by asking the participant not to localize as precisely as possible, but as quickly as possible, assuming that the faster the listener, the higher the spatial quality (Fig. 6). Results show that the time response decreased as a function of HRTF reconstruction accuracy and correlates with the objective metric between HRTF sets, suggesting that time response can be used to rank HRTF sets.
![]() |
Figure 6 HRTF assessment based on the time taken by the listener to localize virtual sounds. From left to right: illustration of a participant during the experiment, average time response (τ) as a function of the HRTF set ([R19, R27, R45, R65, R82, R121]: synthetic HRTF sets reconstructed with increasing accuracy, [I]: individual HRTFs obtained by acoustic measurement, [NI1, NI2, NI3]: HRTF sets taken from other individuals) for 3 participants, objective distance based on ISSD (Inter Subject Spectral difference [98]) between the individual set and the set under assessment (from [148]). |
Binaural spatialization raises a specific question related to the appropriateness of a given set of HRTFs (either measured or synthetic) for one individual. In the context of research studies, it can be solved by a localization experiment, but this solution is not suitable for mass-market services (e.g. content enhancement of music and TV offerings with binaural immersion) to help users to reliably and quickly identify their preferred HRTF set in a pre-selection of sets, all the more so as users are not experts in spatial audio in this case. By using VR tools and 3D audiovisual environment, the localization test can be turned into a game, allowing to determine the HRTF set best suited to one individual in a recreational way. For instance, in [111], players are immersed into a 3D world surrounded by loudspeakers. They are presented sound stimuli and are asked to point the loudspeaker in the perceived direction with a laser. On the basis of the localization error in elevation, HRTF match (among a shortlist of 4 sets) is successfully found for 20 participants out of 24. In this experiment, the level of HRTF personalization is fairly high, given that exact positioning of sound sources is expected. However, highly accurate localization of virtual sources is not always necessary, depending on the task involving sound spatialization. Thus, another experiment also based on a VR shooter game shows that the influence of the HRTF set on player performance is minimal [149].
3.7.2 Sound immersion
Spatial audio contributes to the feeling of immersion into the sound scene, which brings telemeetings closer to real-life conditions. Assessing the immersion conveyed by a spatial sound system is thus another point of interest, but measuring this attribute is not straightforward, if only because of the difficulty of agreeing on its definition [150]. Using questionnaires or direct rating is the usual strategy. Alternatively, physiological and behavioural measurements have been explored, with the advantage of not asking listeners to interpret their perception in terms of immersion. But they require assumptions to be made about the relationship between immersion and the signals observed. For instance, in the study reported in [151], participants perform a task of visual detection based on an oddball paradigm, while being exposed (through Sennheiser HD 650 headphones) to an ambient noise including isolated sounds which act as distractors. Participants are presented a sequence of chessboard pictures, most of them being flawless (i.e. standard stimuli) and some of them being randomly anomalous (i.e. deviant stimuli). They are asked to detect deviant stimuli. Audio stimuli were recorded in the experimental room in 2 versions: one by a KU Neumann dummy head and the other by a stereophonic XY pair. It is assumed that sound immersion reinforces the distracting effect. For this reason, it is expected that the distracting effect of binaural sounds is stronger than that of stereophonic sounds. Results show that detection accuracy is not influenced by the type of audio distractor, whereas the reaction time is significantly longer in case of binaural distractors (566 vs 550 ms). In addition, the participant wears an EGI® sensor net (256 channels) for EEG (Electroencephalogram) monitoring (Fig. 7), which reveals that deviant stimuli infer a P300 effect, whose amplitude is significantly higher in case of binaural distractors, particularly during its first phase known as the P3a (Fig. 7). This part of the P300 is described as reflecting surprise, and, associated with longer reaction times, it has already been identified as a sign of attentional switching. These results suggest that the distracting effect of binaural sounds is stronger, which is in favor of a better immersion conveyed by binaural reproduction. Moreover it confirms that an indicator of sound immersion can be found in EEG signals. Observing brain activity, i.e. at the very place where the percept is formed, is thus emerging as a new tool to study the perception of 3D sound scene.
![]() |
Figure 7 Neuroimaging for perceptual comparison of binaural and stereophonic sounds. From left to right: EGI® sensor net, average ERPs (Event Related Potential) measured in the parieto-occipital region and the frontal region. Blue and purple curves correspond to deviant stimuli in the binaural and stereophonic condition respectively, while red and green curves correspond to standard stimuli in the binaural and stereophonic condition respectively (from [151]). |
3.7.3 Cross-modal interactions
The combination of audio and video information in telemeeting leads to observe and measure cross-modal interactions between auditory and visual stimuli, considering for this latter either 2D or 3D video (i.e. video augmented by depth rendering) [152]. In this context, where video and sound are somehow arbitrarily associated, people are presented with visual and auditory stimuli that are not necessarily perceived as one single percept.
Their spatial integration has been studied in terms of both direction (i.e. azimuth and elevation) [153] and distance [154] to measure the boundary of the integration window. Furthermore, the perception of one modality is likely to influence that of the other. For instance, in the ventriloquism effect, localization is captured by the visual stimulus [155]. The effect has been largely investigated for azimuth discrepancies, but exists also for distance [156–158] and elevation [159] mismatch.
In the case of WFS, a question of particular interest concerns distance perception of sound stimuli in combination with 3D video. In [154, 157, 160], the perceived distance is assessed for 3 conditions: unimodal presentation of either an auditory or a visual stimulus, and bimodal congruent presentation of auditory and visual stimuli, showing that distance is underestimated for far sources. Besides, in [154], the perception of audiovisual stimuli non congruent in distance (while being spatially congruent in azimuth and elevation) is studied. Thus, examining whether the auditory stimulus is able to modify the perceived distance of the visual one revealed no influence (i.e. no reverse ventriloquism effect). The range of distances for which the auditory and visual stimuli are perceived as spatially congruent despite their depth discrepancy is also measured as a function of the distance of the visual stimulus. The extent of the resulting integration window is greater than a few meters and increases with the distance of the visual stimulus (Tab. 1), allowing to emphasize the sound depth in comparison with the visual one without losing the audiovisual congruence. Finally, assessing the overall QoE of 3D audiovisual sequences (3D video and WFS sound) suggests that adding sound depth to 3D video may enhance audiovisual immersion, but the effect strongly depends on the sequence.
3.7.4 Ecological validity of assessment
In any experiment, there is a difficult compromise to be found between internal validity, which is achieved by controlling the experimental parameters to minimize biases, and external validity, which aims at ensuring that the results are the most representative of real life [161, 162]. This problem raises in particular for evaluations involving smartphones which are inherently associated with a wide range of usage conditions, depending on the device, the individual and above all the environment (home, work, public transportation, etc.). This diversity is difficult to replicate in a laboratory, but is needed for the ecological validity of the evaluation. One solution is the Experience Sampling Method [163], which was tested in a study assessing the contribution of binaural sound to the QoE in audiovisual applications for smartphones [164]. During 5 weeks, a panel of 30 participants was invited two times per day by SMS to do a play session of a video game inspired from “Infinite Runner” in the situation where they received the request. The game associates visual and audio information, the latter being randomly rendered in monophony or in binaural. For each session, the following data were collected: context description, final score (assuming that spatial quality may influence the reaction time, as suggested in [148], see Fig. 6) and user’s answers to a questionnaire about the perceived immersion and the memorization of information given during the game. Results show that binaural sound slightly reinforces the player’s immersion, but no significant effect on the score and memorization was observed. In addition, this study highlights the difficulties involved in implementing the ESM paradigm, which requires regular monitoring of participants so that they complete the entire test.
Instead of bringing experiments into real life, VR allows to do the exact opposite, i.e. bringing real life into laboratory experiments, as pointed out by [161]. Indeed, with VR tools, it is possible to synthesize a wide range of environments, while precisely controlling experimental parameters and combining multiple sensory inputs [165]. For instance, the audio modality can be completed by visual, proprioception and vestibular cues. User’s behaviour or participants’ cognitive load can also be varied. VR is a promising direction for future assessments of QoE.
4 Interaction with connected environments
In the 2010’s, a new promising field of research emerged with the Internet of Things (IoT) and connected devices [2]. Environments (e.g. Smart Home, Smart Building, Smart Factory, Smart City, etc.) and objects (refrigerator, washing-machine, oven, furniture, etc.) become not only communicative, but also sensitive. They rely on components which are equipped with sensors, actuators, communication and even processing capabilities, leading to the concept of ambient intelligence. Among all the available sensors, the acoustic modality is of particular interest, resulting in the emergence of the Internet of Sounds (IoS) [166]. A first advantage is that microphones are both inexpensive and easy to implement. Furthermore, we know from our daily experience that the auditory modality convey many useful information. In the evolutionary process, hearing appears as a sense dedicated to alertness: our ears are never closed and listening is omnidirectional. When someone is trying to break into a house, the sound of breaking glass can alert the occupants. A machine operator is also able to detect a dysfunction, or even a future failure, through a change in the machine noise. These are just a few examples of what automatic sound recognition [4] can be used for. It is a research field which is much less well known than automatic speech recognition, but which is considerably more challenging in the sense that it addresses all sounds except speech. However, filling our environment with microphones may raise concerns about a risk of “wiretapping” and privacy protection. Any device that listens to our environment must offer reliable garanties on data protection in order to foster the acceptability of associated services.
The concept of ambient intelligence is based on 3 components: the environment, the user and the system which monitors the environment and controls it to suit the user’s needs. In addition to the ability to perceive the environment, the system requires also tools to act on it. This section gives examples of these two aspects in relation to acoustics, firstly by considering the question of acoustic monitoring of environments (including the creation of dedicated devices for sound recording, and the development of sound recognition models), and secondly the question of their control by using the audio modality (e.g. vocal control, BCI).
4.1 Acoustic monitoring of environments
For several decades, the concept of soundscape has been defined, often with different approaches and based on different skills (e.g. audionaturalist, bioacousticians, sound engineers, etc.). Many projects have been launched to collect sounds not only of nature, but also of towns or specific areas. Most of the time, these sounds are then classified and stored in repositories. They can also be shared in listening sessions to immerse people into these soundscapes, using the technologies which ensure the highest fidelity of recording, coding and reproduction. Acoustic monitoring is used in a growing number of contexts [166]: urban soundscapes to measure noise pollution or to detect dangerous situations (e.g. gun shots in cities), natural soundscapes for the ecoacoustics surveillance and protection of wildlife (e.g. census of animal species in a ecosystem, tracking of migratory bird flows, intrusion detection in a protected area) [167–172], volcano activity to predict eruptions, manufacturing machines with predictive maintenance and anomaly detection, etc. Analysing environmental sounds in the home is another area of application that is arousing increasing interest, for instance for security (e.g. detection of a home invasion, of a person falling or of any distress situation) or home automation (e.g. presence sensing, detection of household appliances) purposes.
4.1.1 Environmental sound recording
An essential point of acoustic monitoring is the capture of environmental sounds. Recording sound outdoors for urban or natural soundscapes is subject to potentially unusual and adverse conditions, requiring microphones and devices which can withstand wind, rain, and high variations of temperature and hygrometry. Solutions for energy harvesting need also to be found. Furthermore, as any sound is of interest, the properties (i.e. frequency bandwidth, temporal pattern, spatial location, etc.) of the signals to be recorded are greatly heterogeneous, and it is difficult to make a priori assumptions about their nature. Consequently, sound recording has to be the most extensive, and should cover a wide area. In addition, recording conditions are expected to be unfavourable in terms of signal-to-noise ratio, in particular because of the potentially large distance between sound sources and microphones, and because of acoustic reverberation. MEMS technology is widely used in this context. The use of smartphones and drones is also spreading.
Acoustic monitoring in buildings (office or home) raises other problems in relation with both privacy protection and aesthetic concern. Placing microphones in the environment is clearly incompatible with privacy protection. But this problem can be fixed by the way in which the audio data collected is used, in particular by ensuring that it is not transmitted externally, even though this data feeds a machine learning process. Federated learning is an example of such a strategy and is described in the next section. Furthermore, as it is not conceivable to overwhelm the environment with microphones, a compromise must be found between the number of microphones and the size of monitored areas. A solution was proposed, consisting of a one-eighth spherical fraction microphone array, which can be placed in the upper corner of a room (Fig. 8) [173–175]. This setup has the advantage of being compact and discreet, to ensure aesthetic acceptability. It is also efficient in the sense that, under the assumption that the walls and the ceiling are perfectly rigid, a 8-microphone array is equivalent to a 64-microphone array, by adding the image microphones generated by the 3 surfaces forming the corner and which complete the sphere. Such a spherical fraction microphone array allows soundfield analysis based on HOA and SH [173], allowing to extract a component of the sound scene by beamforming [175] and to localize sources.
![]() |
Figure 8 Prototypes of one-eighth SFMA. From left to right: 8-MEMS microphone array, 16-MEMS microphone array, example of an installation in a room (from [175]). |
4.1.2 Machine listening
Automatic sound recognition emerged as a research field in 2013 with the first workshop DCASE (Detection and Classification of Acoustic Scenes and Events [4]), which was associated to a challenge illustrating the diversity of the questions raised by this domain: acoustic scene classification, anomalous sound detection for machine condition monitoring, sound event detection and localization, automatic audio captioning, bioacoustic event detection, etc. Nowadays, recognition models are essentially built on artificial neural networks. Therefore the DCASE challenge also addresses issues specific to machine learning: low-complexity models, weakly labelling, few-shot learning, data augmentation, energy consumption, etc. Compared with speech recognition, automatic sound recognition performances are fairly weak at the moment (e.g. 58.4% accuracy for the system ranked first in the Task “Low-complexity acoustic scene classification” of the DCASE 2023 Challenge [4]), but are improving steadily. Sound recognition may be not limited to the audible bandwidth. A recent study about the infrasounds that are generated by human activities shows that a specific infrasonic signature can be identified for each event [176]. A database of infrasound recordings associated to various activities (opening a window or a door, walking through a door, etc.) or machines (washing machine, dishwasher, lawn moxer, etc.) has been collected. Developing associated recognition models is the next step.
However one of the most critical question of automatic sound recognition is privacy protection. Embedding recognition in devices, by quantizing neural network models and implementing processing in nano computers, is a first part of the answer, which eliminates the need to transmit audio data to distant servers. But, if it is wanted to customize the recognition model to a given environment (e.g. a particular house with its inhabitants, their habits and its own soundscape), conventional strategies of machine learning require to export personal audio data for a remote learning procedure, in which the personal database is merged into a larger and generic learning database for an enhanced robustness. A promising alternative is federated learning [177], in which personal acoustic data are not transmitted outside the private environment. Only the parameters of individual recognition models are shared and mixed together to elaborate a global model with optimized performance. These solutions (i.e. embedded recognition and federated learning) are implemented in the device described in [178, 179], aiming at classifying home sound events into 41 categories (e.g. bark, laugh, cough, key jangling, glass shatter, etc.) [180]. The appearance of this prototype is inspired by a cat, whose discretion and vigilance it refers to.
4.2 Brain Computer Interfaces based on spatialized audio stimuli
Over the last few years, voice-operated devices have pervaded our daily lifes, enabling us to control our environment in a way that is both intuitive and fast. Today, one might wonder whether the next evolution would not be BCIs [6, 7] which aim to act on devices through the mind and which are no longer science fiction. Tools for monitoring brain activity are now available outside neuroscience laboratories. They are often based on EEG sensors. The principle of a BCI is to observe and analyse users’ brain activity, in order to decode their intention and then to infer the expected action. The psychomotor system is thus bypassed, providing an even more intuitive solution of interacting with devices: a kind of universal remote control.
For instance, in the case of a reactive BCI, the user is presented a set of stimuli whose properties are varied. In the Steady State Visually Evoked Potential paradigm, stimuli may be for instance lights flashing at different frequencies, each of them being associated to a specific command (e.g. switch on a lamp, switch on television, etc.) [181]. Users have to focus on the stimulus corresponding to the command of their choice. It is observed that each flashing frequency results in a different activation of the brain. The challenge is to detect these differences in brain activity as a function of the stimulus, and to match the brain response to the proper stimulus. A calibration phase is generally needed to collect the brain responses for the different stimuli. This dataset may then be used to train a classification model allowing to recognize the brain activity pattern corresponding to each stimulus.
Most of the time, visual stimuli are used. Nevertheless the paradigm of pure auditory BCI has been identified as a promising tool [182], with several advantages. Firstly, they can be reproduced by a both simple and inexpensive equipment (e.g. loudspeakers or headphones). Secondly, headphones ensure confidentiality. Thirdly, since the time processing by cognitive processes is significantly shorter for audio stimuli than for visual ones, it can be assumed that reactivity is higher for auditory BCIs than for visual ones [183]. In addition, the auditory modality is compatible with visual deficiencies. Furthermore, spatial presentation (e.g. binaural synthesis) of auditory stimuli has been proven to improve classification performance [184], along with reaction time and overall accuracy [185]. Recently, the Steady State Auditory Evoked Potential paradigm (i.e. auditory version of the aforementioned paradigm) has been successfully tested [186]. Beyond improving their reliability, one of the next challenges for BCIs is to make them more user-friendly, for instance by using natural sounds instead of pure tones. Combining stimuli from several sensory modalities is another aspect to be investigated.
5 Conclusion
This overview of acoustic research in ICT over several decades has shown how rich and varied the developments have been, leading to a continuous broadening of themes and expertise, ranging from telephony in the original sense to neurosciences with BCI. The future is equally rich in questions.
Immersive communication is striving for telepresence and virtual teleportation [187, 188], building on technologies such as eXtended Reality (XR) [189]. In this context, even though tools to capture and render a 3D sound scene are available, there are emerging questions concerning real-time processing of sound scenes, for instance to allow scene analysis (e.g. source localization or extraction), post-production (e.g. adding or suppressing sound components, modifying the room effect, etc.), or interactive rendering [1]. Mixing real and virtual sources requires for instance to properly suppress the room effect and the background noise of the environment where real sources are captured. Besides, offering the listener the possibility to move within the sound scene (and thus to harness active listening) with 6DoF (Degree of Freedom) navigation is one of the major challenge of research in this field. Not only the recording, but also the reproduction and the description of the sound scene need to allow a full, interactive and fluid exploration. Furthermore, audiovisual reproduction of scenes including people and their environment provides a first level of immersion, but enriching this experience with other sensory modalities is the next step towards making VR a truly effective tool for digital twins, the web 3.0 or the Metaverse. Not only does this allow a more natural experience, but multimodal presentation of information is also advantageous in terms of communication, interaction, memory, learning processes, etc. There are many potential applications: telemedecine, distance education and training, team teleworking, remote maintenance operations, not forgetting the enhancement of sensory experiences in entertainment.
Like many other fields, audio processing makes increasing use of artificial neural networks and machine learning, for which large and representative audiosets are needed. Automatic sound recognition is one example, but deep learning can be used for denoising, dereverberation, source localization, audio codecs, etc. Audio data is the lifeblood of machine learning, and data augmentation is used to artificially increase data volume, either by transforming existing data or synthesizing new data. For instance, with VR tools (e.g. spatial sound synthesis, artificial reverberation), complex sound scenes can be created from isolated sounds recorded in anechoic conditions. The influence of interfering sources and ambient noise can be simulated. It is also possible to modify the audio quality, e.g. by introducing compression, bandwidth reduction or any specific transfer function of a recording system. These augmented data contribute to enhance the knowledge of deep learning models, and consequently their performance.
In smart environments, machine listening can be used to infer many information (e.g. event detection, activity recognition). Combining acoustic data with other sensory modalities (e.g. touch, smell) could help to solve ambiguities. Besides, active listening, consisting in emitting acoustic waves to record and analyse the waves reflected by the environment (in a way similar to bat echolocation), allows to infer room geometry, structure components, etc. Coupled with deep learning, it could provide tools to optimize, in an agnostic way, the distribution of WiFi transmitters within a home, in terms of both signal quality and energy efficiency. This idea may be applied similarly to the layout of a loudspeaker system.
Eventually, acoustics opens up prospects for addressing the energy challenge posed by the climate crisis. Solutions for sound energy harvesting are emerging as promising technologies [190], all the more as there are many places on Earth with high noise levels (e.g. urban or industrial environments). Thermoacoustic engines, which are based on the interaction between acoustic and thermal waves, can be used to cool (e.g. refrigerator), heat (e.g. acoustic heat pump) or generate power, with the advantage of reduced maintenance costs and the absence of greenhouse gases [191]. This could be a way of cooling data centers (or any other telecom infrastructure), while at the same time using heat to produce electricity. Stand-alone smart devices could also harvest their own energy by harnessing the environmental noise. All this illustrates how acoustics research is more than ever full of solutions for addressing ICT questions.
Conflict of interest
Author declared no conflict of interests.
Data availability statement
No new data were created or analysed in this study.
Acknowledgments
We thank C. Grégoire and C. Plapous for comments and fruitful discussion that greatly improved the manuscript.
References
- J. Skowronek, A. Raake, G.H. Berndtsson, O.S. Rummukainen, P. Usai, A.N.B. Gunkel, M. Johanson, E.A.P. Habets, L. Malfait, D. Lindero, A. Toet: Quality of experience in telemeetings and videoconferencing: a comprehensive survey. IEEE Access 10 (2022) 63885–63931. [CrossRef] [Google Scholar]
- M. Bunz, G. Meikle: The internet of things. Wiley, Hoboken, NJ, USA, 2017. [Google Scholar]
- Ericsson ConsumerLab: 10 Hot Consumer Trends 2030: The internet of senses, 2019. [Google Scholar]
- Detection and Classification of Acoustic Scenes and Events: https://dcase.community. Accessed 27.11.2023. [Google Scholar]
- X. Huang, J. Baker, R. Reddy: A historical perspective of speech recognition. Communications of the ACM 57, 1 (2014). [Google Scholar]
- M. Clerc, L. Bougrain, F. Lotte: Brain computer interfaces 1: foundations and methods. Wiley, 2016. [CrossRef] [Google Scholar]
- M. Clerc, L. Bougrain, F. Lotte: Brain computer interfaces 2: technologies and applications. Wiley, 2016. [CrossRef] [Google Scholar]
- N. Zacharov: Sensory evaluation of sound. Taylor & Francis Group, 2019. [Google Scholar]
- International Telecommunication Union (ITU-T): Study Group 12. https://www.itu.int/en/ITU-T/about/groups/Pages/sg12.aspx. Accessed November 27, 2023. [Google Scholar]
- European Telecommunication Standards Institute: Technical Committee Speech and Multimedia Transmission Quality. https://www.etsi.org/committee/stq. Accessed November 27, 2023. [Google Scholar]
- 3rd Generation Partnership Project: https://www.3gpp.org. Accessed November 27, 2023. [Google Scholar]
- J.L. Flanagan, D.A. Berkley, K.L. Shipley: A digital teleconferencing system with integrated modalities for human/machine communication: HuMaNet, in: Acoustics, Speech, and Signal Processing, IEEE International Conference on, IEEE Computer Society, 1991. [Google Scholar]
- H. Buchner, S. Spors, W. Kellermann, R. Rabenstein: Full-duplex communication systems using loudspeaker arrays and microphone arrays, in: Proceedings of IEEE International Conference on Multimedia and Expo, IEEE, 2002. [Google Scholar]
- F. Khalil, J.P. Jullien, A. Gilloire: Microphone array for sound pickup in teleconference systems. Journal of the Audio Engineering Society 42, 9 (1994) 691–700. [Google Scholar]
- W. Kellermann: Analysis and design of multirate systems for cancellation of acoustical echoes, in: ICASSP-88, International Conference on Acoustics, Speech, and Signal Processing, IEEE, 1988. [Google Scholar]
- A. Gilloire, M. Vetterli: Adaptive filtering in subbands with critical sampling: analysis, experiments, and application to acoustic echo cancellation. IEEE Transactions on Signal Processing 40, 8 (1992) 1862–1875. [CrossRef] [Google Scholar]
- M.J. Evans, A.I. Tew, J.A.S. Angus: Spatial audio teleconferencing – which way is better? ICAD, 1997. [Google Scholar]
- Recommendation ITU-T P.700: Calculation of loudness for speech communication. ITU-T, 2021. https://www.itu.int/rec/T-REC-P.700-202106-I/en. [Google Scholar]
- M. Wong, R. Duraiswami: Shared-space: spatial audio and video layouts for videoconferencing in a virtual room, in: Immersive and 3D Audio: from Architecture to Automotive (I3DA), 2021, pp. 1–6. https://doi.org/10.1109/I3DA48870.2021.961097. [Google Scholar]
- M. Miyoshi, N. Koizumi: NNT’s research on acoustics for future telecommunication services. Applied Acoustics 36 (1992) 307–326. [CrossRef] [Google Scholar]
- P. Cochrane, D. Heatley, K.H. Cameron: Telepresence-visual telecommunications into the next century, in: Fourth IEE Conference on Telecommunications, Manchester, UK, IEEE, 1993, pp. 175–180. [Google Scholar]
- A. Rimell: Immersive spatial audio for telepresence applications: system design and implementation, in: 16th AES International Conference: Spatial Sound Reproduction, Paper 16-033, AES, 1999. [Google Scholar]
- A. Raake, C. Schlegel, K. Hoeldtke, M. Geier, J. Ahrens: Listening and conversational quality of spatial audio conferencing, in: 40th International AES Conference: Spatial Audio: Sense the Sound of Space, AES, 2010. [Google Scholar]
- A.J. Berkhout, D. de Vries, P. Vogel: Acoustic control by wave field synthesis. Journal of the Acoustical Society of America 93, 5 (1993) 2764–2778. [CrossRef] [Google Scholar]
- R. Nicol, M. Emerit: 3D-sound reproduction over an extensive listening area: a hybrid method derived from holophony and ambisonic, in: 16th AES International Conference: Spatial Sound Reproduction, Paper 16-039, AES, 1999. [Google Scholar]
- T. Ziemer: Wave field synthesis, in: Psychoacoustic Music Sound Field Synthesis, Current Research in Systematic Musicology, vol. 7, Springer, 2020. https://doi.org/10.1007/978-3-030-23033-3_8. [CrossRef] [Google Scholar]
- M.A. Gerzon: Periphony: with-height sound reproduction. Journal of the Audio Engineering Society 21, 1 (1973) 2–10. [Google Scholar]
- J.S. Bamford: An analysis of ambisonic sound systems of first and second order. M.Sc. thesis, University of Waterloo, 1995. [Google Scholar]
- J. Daniel, S. Moreau, R. Nicol: Further investigations of high-order ambisonics and wavefield synthesis for holophonic sound imaging, in: 114th AES Convention, Paper 5788, AES, 2003. [Google Scholar]
- V. Pulkki: Virtual sound source positioning using vector base amplitude panning. Journal of the Audio Engineering Society 45, 6 (1997) 456–466. [Google Scholar]
- H. Møller: Fundamentals of binaural technology. Applied Acoustics 36, 3–4 (1992) 171–218. [CrossRef] [Google Scholar]
- V. Larcher: Techniques de spatialisation des sons pour la réalité virtuelle. Ph.D. thesis, University of Paris 6, 2001. [Google Scholar]
- R. Nicol: Binaural technology. AES Monograph, 2010. [Google Scholar]
- A. Roginska, P. Geluso: Immersive sound: the art and science of binaural and multi-channel audio, 1st ed., Routledge, 2017. https://doi.org/10.4324/9781315707525. [Google Scholar]
- J. Blauert: Spatial hearing: the psychophysics of human sound localization. The MIT Press, 1996. https://doi.org/10.7551/mitpress/6391.001.0001. [Google Scholar]
- J. Daniel: Spatial sound encoding including near field effect: Introducing distance coding filters and a viable new Ambisonic format, in: AES 23rd International Conference, AES, 2003. [Google Scholar]
- F. Olivieri, N. Peters, D. Sen: Scene-based audio and higher order ambisonics: a technology review and application to next-generation audio, vr and 360° video, EBU Technical Review, 2018. [Google Scholar]
- J. Daniel: Représentation de champs acoustiques, application à la transmission et à la restitution de scènes sonores complexes dans un contexte multimédia. Ph.D. thesis, University of Paris 6, 2000. [Google Scholar]
- P. Lecomte, P.A. Gauthier, A. Berry, A. Garcia, C. Langrenne: Directional filtering of Ambisonic sound scenes, in: AES International Conference on Spatial Reproduction – Aesthetics and Science, AES, 2018. [Google Scholar]
- P. Lecomte, P.A. Gauthier, C. Langrenne, A. Berry, A. Garcia: Cancellation of room reflections over an extended area using Ambisonics. Journal of the Acoustical Society of America 143 (2018) 811–828. [CrossRef] [PubMed] [Google Scholar]
- G. Theile: Multichannel natural recording based on psychoacoustic principles, in: AES 108th Convention, Preprint 5156, AES, Paris, 2000. [Google Scholar]
- Soundferences orgnaized by the Society Tregor Sonore: https://tregorsonore.fr/index.php/sonferences-du-tregor/. Accessed November 27, 2023. [Google Scholar]
- P.G. Craven, M.A. Gerzon, US Patent, 4042779, 1977. [Google Scholar]
- B. Rafaely: Analysis and design of spherical microphone arrays. IEEE Transactions on Speech and Audio Processing 13, 1 (2005) 135–143. [Google Scholar]
- D.P. Jarrett, E.A.P. Habets, P.A. Naylor: Theory and applications of spherical microphone array processing, in: Topics in Signal Processing, Springer, 2017. [Google Scholar]
- B. Rafaely: Fundamentals of spherical array processing, in: Springer Topics in Signal Processing, Springer, 2019. [CrossRef] [Google Scholar]
- S. Moreau, J. Daniel, S. Bertet: 3D sound field recording with Higher Order Ambisonics – Objective measurements and validation of spherical microphone, in: AES 120th Convention, Paper 6857, AES, 2006. [Google Scholar]
- F. Zotter, M. Frank: Higher-order ambisonic microphones and the wave equation (linear, lossless), in: Ambisonics. Springer Topics in Signal Processing, vol. 19, Springer, Cham, 2019. [CrossRef] [Google Scholar]
- N. Epain, J. Daniel: Improving spherical microphone arrays, in: AES 124th Convention, Paper 7479, 2008. [Google Scholar]
- J. Palacino, R. Nicol: Spatial sound pick-up with a low number of microphones. ICA, 2013. [Google Scholar]
- M.-V. Laitinen, L. Laaksonen, J. Vilkamo: Spatial audio representation and rendering. Patent EP 3757992, 2020. [Google Scholar]
- Diapason: Rennes Opera goes 3D for Don Giovanni, L’Opéra de Rennes se met à la 3D pour Don Giovanni (in French), 2009. https://www.diapasonmag.fr/a-laune/lopera-de-rennes-se-met-a-la-3d-pour-don-giovanni-12989.html. Accessed November 27, 2023. [Google Scholar]
- mh acoustics LLC: https://mhacoustics.com. Accessed November 27, 2023. [Google Scholar]
- Zylia: https://www.zylia.co. Accessed November 27, 2023. [Google Scholar]
- A. Farina, L. Tronchin: 3D sound characterization in theatres employing microphone arrays. Acta Acustica united with Acustica 99 (2013) 118–125. [CrossRef] [Google Scholar]
- P. Massé: Analysis, treatment, and manipulation methods for spatial room impulse responses measured with spherical microphone arrays. Ph.D. thesis, Sorbonne Université, 2019. [Google Scholar]
- J. Daniel, S. Kitic: Echo-enabled direction-of-arrival and range estimation of a mobile source in ambisonic domain, in: 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, IEEE, 2022, pp. 852–856. https://doi.org/10.23919/EUSIPCO55093.2022.9909743. [CrossRef] [Google Scholar]
- J. Blauert (Ed.), The technology of binaural listening. Springer, 2020. https://doi.org/10.1007/978-3-642-37762-4. [Google Scholar]
- D.R. Begault, E.M. Wenzel, M.R. Anderson: Direct comparison of the impact of head tracking, reverberation, and individualized head-related transfer functions on the spatial perception of a virtual speech source. Journal of the Audio Engineering Society 49 (2001) 904–916. [Google Scholar]
- E. Hendrickx, P. Stitt, J.-C. Messonnier, J.-M. Lyzwa, B.F.G. Katz, C. de Boishéraud: Influence of head tracking on the externalization of speech stimuli for non-individualized binaural synthesis. Journal of the Acoustical Society of America 141, 3 (2017) 2011–2023. [CrossRef] [PubMed] [Google Scholar]
- H. Møller, M.F. Sørensen, D. Hammershøi, C.B. Jensen: Head related transfer functions of human subjects. Journal of the Audio Engineering Society 43, 5 (1995) 300–321. [Google Scholar]
- V.R. Algazi, R.O. Duda, D.P. Thompson, C. Avendano: The CIPIC HRTF database, in: Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics, IEEE, 2001. [Google Scholar]
- J.M. Pernaux, M. Emerit, J. Daniel, R. Nicol: Perceptual evaluation of static binaural sound, in: 22nd AES International Conference: Virtual, Synthetic, and Entertainment Audio, AES, 2002. [Google Scholar]
- LISTEN HRTF database: http://recherche.ircam.fr/equipes/salles/listen/. Accessed November 27, 2023. [Google Scholar]
- ARI HRTF database: https://www.oeaw.ac.at/isf/das-institut/software/hrtf-database. Accessed November 27, 2023. [Google Scholar]
- FABIAN HRTF database: https://depositonce.tu-berlin.de/items/bff6568a-5735-4ebc-b3fa-ac10707b7beb. Accessed November 27, 2023. [Google Scholar]
- N. Gupta, A. Barreto, M. Joshi, J.C. Agudelo: HRTF database at FIU DSP Lab, in: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2010, pp. 169–172. https://doi.org/10.1109/ICASSP.2010.5496084. [CrossRef] [Google Scholar]
- K. Watanabe, Y. Iwaya, Y. Suzuki, S. Takane, S. Sato: Dataset of head-related transfer functions measured with a circular loudspeaker array. Acoustical Science and Technology 35, 3 (2014) 159–165. [CrossRef] [Google Scholar]
- C.T. Jin, P. Guillon, N. Epain, R. Zolfaghari, A. van Schaik, A.I. Tew, C. Hetherington, J. Thorpe: Creating the Sydney York morphological and acoustic recordings of ears database. IEEE Transactions on Multimedia 16, 1 (2014) 37–46. [CrossRef] [Google Scholar]
- ITA HRTF database: https://www.akustik.rwth-aachen.de/go/id/lsly. Accessed November 27, 2023. [Google Scholar]
- F. Brinkmann, M. Dinakaran, R. Pelzer, P. Grosche, D. Voss, S. Weinzierl: A cross-evaluated database of measured and simulated HRTFs including 3D head meshes, anthropometric features, and headphone impulse responses. Journal of the Audio Engineering Society 67, 9 (2019) 705–719. [CrossRef] [Google Scholar]
- I. Engel, R. Daugintis, T. Vicente, A.O.T. Hogg, J. Pauwels, A.J. Tournier, L. Picinali: The SONICOM HRTF dataset. Journal of the Audio Engineering Society 71, 5 (2023) 241–253. [CrossRef] [Google Scholar]
- P. Minnaar, J. Plogsties, C. Flemming, Directional resolution of head-related transfer functions required in binaural synthesis. Journal of the Audio Engineering Society 53, 10 (2005) 919–929. [Google Scholar]
- S. Carlile, C. Jin, V. van Raad: Continuous virtual auditory space using HRTF interpolation: Acoustic and psychophysical errors, in: Proceedings of the First IEEE Pacific-Rim Conference on Multimedia, IEEE, 2000, pp. 220–223. [Google Scholar]
- R. Martin, K. McAnally: Interpolation of head-related transfer functions. Technical Report DSTO-RR-0323, Australian Government – Department of Defence, 2007. [Google Scholar]
- BiLi Project (in French): https://www.espace-sciences.org/sciences-ouest/310/dossier/immersion-dans-le-son. Accessed November 27, 2023. [Google Scholar]
- T. Carpentier, H. Bahu, M. Noisternig, O. Warusfel: Measurement of a head-related transfer function database with high spatial resolution, in: 7th Forum Acusticum, Krakow, Poland, EAA, 2014. [Google Scholar]
- F. Rugeles Ospina: Individualisation de l’écoute binaurale: création et transformation des indices spectraux et des morphologies des individus. Ph.D. thesis, University of Paris 6, 2016. [Google Scholar]
- F. Rugeles Ospina, M. Emerit, B.F.G. Katz: The three-dimensional morphological database for spatial hearing research of the BiLi project, in: Proc. of Meetings on Acoustics, Acoustical Society of America (ASA), 2015. [Google Scholar]
- P. Majdak, F. Zotter, F. Brinkmann, J. De Muynke, M. Mihocic, M. Noisternig: Spatially oriented format for acoustics 2.1: Introduction and recent advances, Journal of the Audio Engineering Society 70, 7/8 (2022) 565–584. [CrossRef] [Google Scholar]
- P. Majdak, Y. Iwaya, T. Carpentier, R. Nicol, M. Parmentier, A. Roginska, Y. Suzuki, K. Watanabe, H. Wierstorf, H. Ziegelwanger, M. Noisternig: Spatially oriented format for acoustics: a data exchange format representing head-related transfer functions, in: AES 134th Convention, AES, 2013. [Google Scholar]
- SOFA (Spatially Oriented Format for Acoustics): https://www.sofaconventions.org/mediawiki/index.php/SOFA_(Spatially_Oriented_Format_for_Acoustics). Accessed November 27, 2023. [Google Scholar]
- D.N. Zotkin, R. Duraiswami, E. Grassi, N.A. Gumerov: Fast head-related transfer function measurement via reciprocity. Journal of the Acoustical Society of America 120, 4 (2006) 2202–2215. [CrossRef] [PubMed] [Google Scholar]
- G. Enzner: 3D-continuous-azimuth acquisition of head-related impulse responses using multi-channel adaptive filtering, in: 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, IEEE, 2009, pp. 325–328. [CrossRef] [Google Scholar]
- M. Pollow, B. Masiero, P. Dietrich, J. Fels, M. Vorländer: Fast measurement system for spatially continuous individual HRTFs, in: 4th Int. Symposium on Ambisonics and Spherical Acoustics, 25th AES UK Conference, AES, University of York, UK, 2012. [Google Scholar]
- P. Majdak, P. Balazs, B. Laback: Multiple exponential sweep method for fast measurement of head-related transfer functions. Journal of the Audio Engineering Society 55, 7/8 (2007) 623–637. [Google Scholar]
- J. Richter, G. Behler, J. Fels: Evaluation of a fast HRTF measurement system, in: 140th International AES Convention, France, Paris, AES, 2016. [Google Scholar]
- S. Busson, R. Nicol, V. Choqueuse, V. Lemaire: Non-linear interpolation of head related transfer function. CFA, 2006. [Google Scholar]
- P. Guillon, R. Nicol, L. Simon: Head-Related Transfer Functions reconstruction from sparse measurements considering a priori knowledge from database analysis: a pattern recognition approach, in: AES 125th Convention, Paper 7610, AES, 2008. [Google Scholar]
- B.-S. Xie: Recovery of individual head-related transfer functions from a small set of measurements. Journal of the Acoustical Society of America 132, 1 (2012) 282–294. [CrossRef] [PubMed] [Google Scholar]
- M. Maazaoui, O. Warusfel: Estimation of individualized HRTF in unsupervised conditions, in: 140th International AES Convention, AES, 2016. [Google Scholar]
- A. Moreau, O. Warusfel: Identification de HRTFs individuelles par selfies binauraux et apprentissage machine. CFA, 2022. [Google Scholar]
- E.M. Wenzel, M. Arruda, D.J. Kistler, F.L. Wightman: Localization using nonindividualized head-related transfer functions. Journal of the Acoustical Society of America 94, 1 (1993) 111–123. [CrossRef] [PubMed] [Google Scholar]
- P.M. Hofman, J.G. Van Riswick, A.J. Van Opstal: Relearning sound localization with new ears. Nature neuroscience 1, 5 (1998) 417–421. [CrossRef] [PubMed] [Google Scholar]
- D. Poirier-Quinot, B.F.G. Katz: On the improvement of accomodation to non-individual HRTFs via VR active learning and inclusion of a 3D room response. Acta Acustica 5 (2021) 25. [CrossRef] [EDP Sciences] [Google Scholar]
- F.L. Wightman, D.J. Kistler: Headphone simulation of free-field listening. II: Psychophysical validation. Journal of the Acoustical Society of America 85, 2 (1989) 868–878. [CrossRef] [PubMed] [Google Scholar]
- T.D. Mrsic-Flogel, A.J. King, R.L. Jenison, J.W. Schnupp: Listening through different ears alters spatial response fields in ferret primary auditory cortex. Journal of Neurophysiology 86 (2001) 1043–1046. [CrossRef] [PubMed] [Google Scholar]
- J.C. Middlebrooks: Virtual localization improved by scaling nonindividualized external-ear transfer functions in frequency. Journal of the Acoustical Society of America 106, 3 (1999) 1493–1510. [CrossRef] [PubMed] [Google Scholar]
- C.T. Jin, P. Leong, J. Leung, A. Corderoy, S. Carlile: Enabling individualized virtual auditory space using morphological measurements, in: Proceedings of the First IEEE Pacific-Rim Conference on Multimedia, Citeseer, 2000. [Google Scholar]
- B.F.G. Katz: Boundary element method calculation of individual head-related transfer function. I. Rigid model calculation. Journal of the Acoustical Society of America 110, 5 (2001) 2440–2448. [CrossRef] [PubMed] [Google Scholar]
- V.R. Algazi, R.O. Duda, R. Duraiswami, N.A. Gumerov, Z. Tang: Approximating the head-related transfer function using simple geometric models of the head and torso. Journal of the Acoustical Society of America 112, 5 (2002) 2053–2064. [CrossRef] [PubMed] [Google Scholar]
- D.N. Zotkin, J. Hwang, R. Duraiswami, L.S. Davis: HRTF personalization using anthropometric measurements, in: 2003 IEEE workshop on applications of signal processing to audio and acoustics, IEEE, 2003. [Google Scholar]
- S. Hwang, Y. Park, Y. Park: Modeling and customization of head related impulse responses based on general basis functions in time domain. Acta Acustica United with Acustica 94, 6 (2008) 965–980. [CrossRef] [Google Scholar]
- S. Hwang, Y. Park: Interpretations on principal components analysis of head-related impulse responses in the median plane. Journal of the Acoustical Society of America 123, 4 (2008) EL65–EL71. [CrossRef] [PubMed] [Google Scholar]
- M. Dellepiane, N. Pietroni, N. Tsingos, M. Asselot, R. Scopigno: Reconstructing head models from photographs for individualized 3D-audio processing, in: Computer Graphics Forum, Blackwell Publishing Ltd., Oxford, UK, 2008, pp. 1719–1727. [CrossRef] [Google Scholar]
- S. Xu, Z. Li, G. Salvendy: Individualized head-related transfer functions based on population grouping. Journal of the Acoustical Society of America 124, 5 (2008) 2708–2710. [CrossRef] [PubMed] [Google Scholar]
- A. Lindau, J. Estrella, S. Weinzierl: Individualization of dynamic binaural synthesis by real time manipulation of ITD, in: 128th Audio Engineering Society Convention, AES, 2010. [Google Scholar]
- K. Iida, Y. Ishii, S. Nishioka: Personalization of head-related transfer functions in the median plane based on the anthropometry of the listener’s pinnae. Journal of the Acoustical Society of America 136, 1 (2014) 317–333. [CrossRef] [PubMed] [Google Scholar]
- K.J. Fink, L. Ray: Individualization of head related transfer functions using principal component analysis. Applied Acoustics 87 (2015) 162–173. [CrossRef] [Google Scholar]
- R. Bomhardt, M. Lins, J. Fels: Analytical ellipsoidal model of interaural time differences for the individualization of head-related impulse responses, Journal of the Audio Engineering Society 64, 11 (2016) 882–894. [CrossRef] [Google Scholar]
- R. Nicol, M. Emerit, L. Gros, HRTF “prêt-à-porter” pour le son binaural dans les futurs contenus d’Orange. CFA, 2018. [Google Scholar]
- B.F.G. Katz, G. Parseihian: Perceptually based head-related transfer function database optimization. Journal of the Acoustical Society of America 131 (2012) EL99–EL105. [CrossRef] [PubMed] [Google Scholar]
- P.Y. Michaud, R. Nicol: Multi dimensional scaling of perceived dissimilarities between non-individual HRTFs: investigating the perceptual space of binaural synthesis. BiLi Project Deliverable, 2015. [Google Scholar]
- P. Guillon, T. Guignard, R. Nicol: Head-related transfer function customization by frequency scaling and rotation shift based on a new morphological matching method, in: 125th AES Convention, Paper 7550, AES, 2008. [Google Scholar]
- M. Emerit, F. Rugeles Ospina, R. Nicol: Transformer un jeu de HRTF en un autre à partir de données morphologiques. CFA – VISHNO, 2016. [Google Scholar]
- Y. Kahana, P.A. Nelson: Boundary element simulations of the transfer function of human heads and baffled pinnae using accurate geometric models. Journal of Sound and Vibration 300, 3–5 (2007) 552–579. [CrossRef] [Google Scholar]
- M. Pollow, K.-V. Nguyen, O. Warusfel, T. Carpentier, M. Müller-Trapet, M. Vorländer, M. Noisternig: Calculation of head-related transfer functions for arbitrary field points using spherical harmonics decomposition. Acta Acustica united with Acustica 98, 1 (2012) 72–82. [CrossRef] [Google Scholar]
- D.J. Kistler, F.L. Wightman: A model of head-related transfer functions based on principal components analysis and minimum-phase reconstruction. Journal of the Acoustical Society of America 91, 3 (1992). [Google Scholar]
- J. Blauert, J. Braasch (Eds.): The technology of binaural understanding. Springer, 2020. https://doi.org/10.1007/978-3-030-00386-9. [CrossRef] [Google Scholar]
- R. Nicol, M. Emerit: Reproducing 3D-sound for videoconferencing: a comparison between holophony and ambisonic. D.A.F.X., 1998. [Google Scholar]
- J.-M. Jot, V. Larcher, J.-M. Pernaux: A comparative study of 3-D audio encoding and rendering techniques, in: 16th AES International Conference: Spatial Sound Reproduction, Paper 16-025, AES, 1999. [Google Scholar]
- M. Relieu: La téléprésence, ou l’autre visiophonie. Réseaux 5, 144 (2007) 183–223. [Google Scholar]
- S. Brix, T. Sporer, J. Plogsties: CARROUSO – an European approach to 3D-audio, in: 110th AES Convention, Paper 5314, AES, 2001. [Google Scholar]
- R. Väänänen, O. Warusfel, M. Emerit: Encoding and rendering of perceptual sound scenes in the CARROUSO project, in: 22nd International AES Conference: Virtual, Synthetic, and Entertainment Audio, AES, 2002. [Google Scholar]
- E. Corteel, U. Horbach, R.S. Pellegrini: Multichannel inverse filtering of multiexciter distributed mode loudspeaker for wave field synthesis, in: 112th AES Convention, Paper 5611, AES, 2002. [Google Scholar]
- BlueJeans: https://www.bluejeans.com/. Accessed November 27, 2023. [Google Scholar]
- BT MeetMe with Dolby Voice: www.btconferencing.com/meetme-with-dolby-voice/meetme-with-dolby-voice_en.pdf. Accessed November 27, 2023. [Google Scholar]
- Dolby Voice: https://docs.dolby.io/communications-apis/docs/guides-dolby-voice. Accessed November 27, 2023. [Google Scholar]
- Cisco IX5000 Series: https://www.cisco.com/c/en/us/products/collateral/collaboration-endpoints/ix5000-series/datasheet-c78-733257.html. Accessed November 27, 2023. [Google Scholar]
- F. Rumsey: Spatial audio processing: Upmix, downmix, shake it all about. Journal of the Audio Engineering Society 61, 6 (2013) 474–478. [Google Scholar]
- W.H. Nam, T. Lee, S.C. Ko, Y. Son, H.K. Chung, K.-R. Kim, J. Kim, S. Hwang, K. Lee: AI 3D immersive audio codec based on content-adaptive dynamic down-mixing and up-mixing framework, in: 151st AES Convention, Paper 10525, AES, 2021. [Google Scholar]
- G. Lorho, N. Zacharov: Subjective evaluation of virtual home theater sound systems for loudspeakers and headphones, in: 116th AES Convention, Paper 6141, AES, 2004. [Google Scholar]
- C. Pike, F. Melchior: An assessment of virtual surround sound systems for headphone listening of 5.1 multichannel audio, in: 134th AES Convention, Paper 8819, AES, 2013. [Google Scholar]
- H. Møller, C.B. Jensen, D. Hammershøi, M.F. Søresen: Design criteria for headphones. Journal of the Audio Engineering Society 43, 4 (1995) 218–232. [Google Scholar]
- P. Rueff, R. Nicol, J. Palacino: Characterization of a wide selection of headphones for binaural reproduction: measurement of electro-acoustic, magnetic and ergonomics features. BiLi Project Deliverable, 2015. [Google Scholar]
- F. Baumgarte, C. Faller: Binaural cue coding–part I: psychoacoustic fundamentals and design principles. IEEE Transactions on Speech and Audio Processing 11, 6 (2003) 509–519. [CrossRef] [Google Scholar]
- C. Faller, F. Baumgarte: Binaural cue coding–part II: schemes and applications. IEEE Transactions on Speech and Audio Processing 11, 6 (2003) 520–531. [CrossRef] [Google Scholar]
- M.A. Gerzon: Ambisonics in Multichannel Broadcasting and Video. Journal of the Audio Engineering Society 33, 11 (1985) 859–871. [Google Scholar]
- A. Daniel: Spatial auditory blurring and applications to multichannel audio coding. Ph.D. thesis, University of Paris 6, 2011. [Google Scholar]
- Standard ISO/IEC 23008-3:2019: Information Technology – High Efficiency Coding and Media Delivery in Heterogeneous Environments – Part 3: 3D Audio, 2019. [Google Scholar]
- S.R. Quackenbush, J. Herre: MPEG standards for compressed representation of immersive audio. Proceedings of the IEEE 109, 9 (2021) 1578–1589. [CrossRef] [Google Scholar]
- IVAS: https://www.3gpp.org/technologies/ivas-highlights. Accessed November 27, 2023. [Google Scholar]
- ITU-R BS.1116-3: Methods for the subjective assessment of small impairments in audio systems, Technical Report, 2015. [Google Scholar]
- ITU-R BS.1284-2: General methods for the subjective assessment of sound quality, Technical Report 2019. [Google Scholar]
- R. Nicol, L. Gros, C. Colomes, M. Noisternig, O. Warusfel, H. Bahu, B.F.G. Katz, L.S.R. Simon: A roadmap for assessing the quality of experience of 3D audio binaural rendering, in: EAA Joint Symposium on Auralization and Ambisonics, EAA, 2014. [Google Scholar]
- J.M. Pernaux, M. Emerit, R. Nicol: Perceptual evaluation of binaural sound synthesis: the problem of reporting localization judgments, in: 114th AES Convention, Paper 5789, AES, 2003. [Google Scholar]
- H. Bahu, T. Carpentier, M. Noisternig, O. Warusfel: Comparison of different egocentric pointing methods for 3D sound localization experiments. Acta Acustica united with Acustica 102, 1 (2016) 107–118. [CrossRef] [Google Scholar]
- P. Guillon: Individualisation des indices spectraux pour la synthèse binaurale: recherche et exploitation des similarités inter-individuelles pour l’adaptation ou la reconstruction de HRTF. Ph.D. thesis, Le Mans Université, 2009. [Google Scholar]
- D. Poirier-Quinot, B.F.G. Katz: Assessing the impact of Head-Related Transfer Function individualization on task performance: case of a virtual reality shooter game. Journal of the Audio Engineering Society 68, 4 (2020) 248–260. [CrossRef] [Google Scholar]
- S. Agrawal, A. Simon, S. Bech, K. Bærentsen, S. Forchhammer: Defining immersion: literature review for research on audiovisual experiences. Journal of the Audio Engineering Society 68, 6 (2020) 404–417. [CrossRef] [Google Scholar]
- R. Nicol, O. Dufor, L. Gros, P. Rueff, N. Farrugia: EEG measurement of binaural sound immersion, in: EAA Spatial Audio Signal Processing Symposium, EAA, 2019. [Google Scholar]
- E. Hendrickx, M. Paquier, V. Koehl: Audiovisual spatial coherence for 2D and stereoscopic-3D movies. Journal of the Audio Engineering Society 63, 11 (2015) 889–899. [CrossRef] [Google Scholar]
- J. Moreira, L. Gros, R. Nicol, I. Viaud-Delmon: Spatial auditory-visual integration: the case of binaural sound on a smartphone, in: AES 145th Convention, paper 10130, AES, 2018. [Google Scholar]
- S. Moulin, R. Nicol, L. Gros, P. Mamassian: Audio-visual spatial integration in distance dimension - when wave field synthesis meets stereoscopic-3D, in: 55th AES International Conference: Spatial Audio, AES, 2014. [Google Scholar]
- I.P. Howard, W.B. Templeton: Human spatial orientation. John Wiley & Sons, 1966. [Google Scholar]
- N. Côté, V. Koehl, M. Paquier: Ventriloquism on distance auditory cues, in: Acoustics 2012 Joint Congress, SFA and IOA, 2012. [Google Scholar]
- S. Moulin, R. Nicol, L. Gros: Auditory distance perception in real and virtual environments, in: Proceedings of the ACM Symposium on Applied Perception (SAP ‘13), Association for Computing Machinery (ACM), 2013. https://doi.org/10.1145/2492494.2501876. [Google Scholar]
- P. Zahorik: Asymmetric visual capture of virtual sound sources in the distance dimension. Frontiers in Neuroscience 16 (2022) 958577. [CrossRef] [PubMed] [Google Scholar]
- E. Hendrickx, M. Paquier, V. Koehl, J. Palacino: Ventriloquism effect with sound stimuli varying in both azimuth and elevation. Journal of the Acoustical Society of America 138 (2015) 3686–3697. [CrossRef] [PubMed] [Google Scholar]
- M. Rébillat, X. Boutillon, É. Corteel, B.F. Katz: Audio, visual, and audio-visual egocentric distance perception by moving subjects in virtual environments. ACM Transactions on Applied Perception (TAP) 9, 4 (2012) 1–17. [CrossRef] [Google Scholar]
- J. Blascovich, J. Loomis, A.C. Beall, K.R. Swinth, C.L. Hoyt, J.N. Bailenson: Immersive virtual environment technology as a methodological tool for social psychology. Psychological Inquiry 13, 2 (2002) 103–124. [CrossRef] [Google Scholar]
- G. Keidser, G. Naylor, D.S. Brungart, A. Caduff, J. Campos, S. Carlile, M.G. Carpenter, G. Grimm, V. Hohmann, I. Holube, S. Launer, T. Lunner, R. Mehra, F. Rapport, M. Slaney, K. Smeds: The quest for ecological validity in hearing science: what it is, why it matters, and how to advance it. Ear Hear 41, Suppl. 1 (2020) 5S–19S. [CrossRef] [PubMed] [Google Scholar]
- R. Larson, M. Csikszentmihalyi: Flow and the foundations of positive psychology, in: The experience sampling method, Springer, 2014. [Google Scholar]
- J. Moreira: Evaluer l’apport du binaural dans une application mobile audiovisuelle. Ph.D. thesis, CNAM, 2019. [Google Scholar]
- T. Robotham, O.S. Rummukainen, M. Kurz, M. Eckert, E.A.P. Habets: Comparing direct and indirect methods of audio quality evaluation in virtual reality scenes of varying complexity. IEEE Transactions on Visualization and Computer Graphics 28, 5 (2022) 2091–2101. [CrossRef] [PubMed] [Google Scholar]
- L. Turchet, M. Lagrange, C. Rottondi, G. Fazekas, N. Peters, J. Østergaard, F. Font, T. Bäcksträm, C. Fischione: The internet of sounds: convergent trends, insights, and future directions. IEEE Internet of Things Journal 10, 13 (2023) 11264–11292. [CrossRef] [Google Scholar]
- BirdNET: https://birdnet.cornell.edu. Accessed November 27, 2023. [Google Scholar]
- C.M. Wood, S. Kahl, P. Chaon, M.Z. Peery, H. Klinck: Survey coverage, recording duration and community composition affect observed species richness in passive acoustic surveys. Methods in Ecology and Evolution 12, 5 (2021) 885–896. [CrossRef] [Google Scholar]
- S. Kahl, C.M. Connor, M. Eibl, H. Klinck: BirdNET: a deep learning solution for avian diversity monitoring. Ecological Informatics 61 (2021) 101236. [CrossRef] [Google Scholar]
- BUGG: https://www.bugg.xyz. Accessed November 27, 2023. [Google Scholar]
- S.S. Sethi, N.S. Jones, B.D. Fulcher, L. Picinali, D.J. Clink, H. Klinck, C.D.L. Orme, P.H. Wrege, R.M. Ewers: Characterizing soundscapes accross diverse ecosystems using a universal acoustic feature set. PNAS 117, 29 (2020) 17049–17055. [CrossRef] [PubMed] [Google Scholar]
- S.S. Sethi, R.M. Ewers, N.S. Jones, A. Signorelli, L. Picinali, C.D.L. Orme: SAFE Acoustics: an opensource, real-time eco-acoustic monitoring network in the tropical rainforests of Borneo. Methods in Ecology and Evolution 11 (2020) 1182–1185. [CrossRef] [Google Scholar]
- P. Lecomte, M. Melon, L. Simon: Spherical fraction beamforming, in: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, IEEE, 2020, pp. 2996–3009. https://doi.org/10.1109/TASLP.2020.3034516. [CrossRef] [Google Scholar]
- P. Lecomte, T. Blanchard, M. Melon, L. Simon, K. Hassan, R. Nicol: One eighth of a sphere microphone array, in: Forum Acusticum, Lyon, France, EAA, 2020, pp. 313–318. [Google Scholar]
- T. Blanchard, P. Lecomte, M. Melon, L. Simon, K. Hassan, R. Nicol: Experimental acoustic scene analysis using One-Eighth spherical fraction microphone array. Journal of the Acoustical Society of America 151, 1 (2022) 180–192. [CrossRef] [PubMed] [Google Scholar]
- R. Nicol, C. Plapous, L. Avenel, T. Le Du: Recording and analyzing infrasounds to monitor human activities in buildings, in: Forum Acusticum, Torino, Italy, EAA, 2023. [Google Scholar]
- T. Li, A.K. Sahu, A. Talwalkar, V. Smith: Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine 37, 3 (2020) 50–60. [Google Scholar]
- A machine that lends an ear: https://hellofuture.orange.com/en/a-machine-that-lends-an-ear/. Accessed November 27, 2023. [Google Scholar]
- L. Delphin-Poulat, C. Plapous: Mean teacher with data augmentation for DCASE 2019 Task 4. Technical Report, DCASE Challenge, 2019. [Google Scholar]
- J.F. Gemmeke, D.P.W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R.C. Moore, M. Plakal, M. Ritter: Audio set: an ontology and human-labeled dataset for audio events, in: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA, IEEE, 2017, pp. 776–780. https://doi.org/10.1109/ICASSP.2017.7952261. [CrossRef] [Google Scholar]
- H. Si-Mohammed, C. Haumont, A. Sanchez, C. Plapous, F. Bouchnak, J.-P. Javaudin, A. Lécuyer: Designing functional prototypes combining BCI and AR for home automation, in: Virtual Reality and Mixed Reality, EuroXR, Springer, Cham, 2022. https://doi.org/10.1007/978-3-031-16234-3_1. [Google Scholar]
- M. Schreuder, T. Rost, M. Tangermann: Listen, you are writing! Speeding up online spelling with a dynamic auditory BCI. Frontiers in Neuroscience 5 (2011) 112. [CrossRef] [PubMed] [Google Scholar]
- A. Jain, R. Bansal, A. Kumar, K.D. Singh: A comparative study of visual and auditory reaction times on the basis of gender and physical activity levels. International Journal of Applied and Basic Medical Research 5, 2 (2015) 124–127. [CrossRef] [PubMed] [Google Scholar]
- M. Schreuder, B. Blankertz, M. Tangermann: A new auditory multi-class brain-computer interface paradigm: Spatial hearing as an informative cue. PLoS One 5, 4 (2010) e9813. https://doi.org/10.1371/journal.pone.0009813. [CrossRef] [PubMed] [Google Scholar]
- A. Belitski, J. Farquhar, P. Desain: P300 audio-visual speller. Journal of Neural Engineering 8, 2 (2011) 025022. [CrossRef] [PubMed] [Google Scholar]
- L. Guého: Interface cerveau-machine basée sur des stimuli auditifs, Rapport de stage Master 2 Acoustique et Musicologie. Aix-Marseille Université, Orange Labs, 2022. [Google Scholar]
- S. Orts-Escolano, C. Rhemann, S. Fanello, W. Chang, A. Kowdle, Y. Degtyarev, D. Kim, P.L. Davidson, S. Khamis, M. Dou, V. Tankovivh, C. Loop, Q. Cai, P.A. Chou, S. Mennicken, J. Valentin, V. Pradeep, S. Wang, S.B. Kang, P. Kohli, Y. Lutchyn, C. Keskin, S. Izadi: Holoportation: virtual 3D teleportation in real-time, in: Proceedings of the 29th Annual Symposium on User Interface Software and Technology, ACM, 2016, pp. 741–754. [CrossRef] [Google Scholar]
- B. Jones, Y. Zhang, P.N.Y. Wong, S. Rintel: Belonging there: VROOM-ing into the Uncanny Valley of XR telepresence, in: Proceedings of the ACM on Human-Computer Interaction, vol. 5, CSCW1, ACM, 2021. Article 59. https://doi.org/10.1145/3449133. [Google Scholar]
- KHRONOS: https://www.khronos.org. Accessed November 27, 2023. [Google Scholar]
- J. Choi, I. Jung, C.-Y. Kang: A brief review of sound energy harvesting. Nano Energy 56 (2019) 169–183. ISSN 2211-2855 [CrossRef] [Google Scholar]
- S. Garrett: Thermoacoustic engines and refrigerators. CFA/VISHNO, 2016. [Google Scholar]
Cite this article as: Nicol R. & Monfort J-Y. 2023. Acoustic research for telecoms: bridging the heritage to the future. Acta Acustica, 7, 64.
All Tables
All Figures
![]() |
Figure 1 History of ICT transformations along with the impact on acoustic research. |
In the text |
![]() |
Figure 2 Anechoic chamber of Orange Lab in Lannion (© Orange DGCI). |
In the text |
![]() |
Figure 3 Examples of HOA SMAs. From left to right: 3 prototypes developped by Orange (© J. Daniel), the Eigenmike® by mh acoutics [53] and the Zylia ZM-1 microphone [54]. |
In the text |
![]() |
Figure 4 System of HRTF measurement installed in the anechoic chamber of Orange Lab in Lannion. |
In the text |
![]() |
Figure 5 Telepresence Wall at France Telecom R&D in Issy-Les-Moulineaux, France (from [122]). |
In the text |
![]() |
Figure 6 HRTF assessment based on the time taken by the listener to localize virtual sounds. From left to right: illustration of a participant during the experiment, average time response (τ) as a function of the HRTF set ([R19, R27, R45, R65, R82, R121]: synthetic HRTF sets reconstructed with increasing accuracy, [I]: individual HRTFs obtained by acoustic measurement, [NI1, NI2, NI3]: HRTF sets taken from other individuals) for 3 participants, objective distance based on ISSD (Inter Subject Spectral difference [98]) between the individual set and the set under assessment (from [148]). |
In the text |
![]() |
Figure 7 Neuroimaging for perceptual comparison of binaural and stereophonic sounds. From left to right: EGI® sensor net, average ERPs (Event Related Potential) measured in the parieto-occipital region and the frontal region. Blue and purple curves correspond to deviant stimuli in the binaural and stereophonic condition respectively, while red and green curves correspond to standard stimuli in the binaural and stereophonic condition respectively (from [151]). |
In the text |
![]() |
Figure 8 Prototypes of one-eighth SFMA. From left to right: 8-MEMS microphone array, 16-MEMS microphone array, example of an installation in a room (from [175]). |
In the text |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.