On the identi ﬁ cation and assessment of underlying acoustic dimensions of soundscapes

– The concept of soundscapes according to ISO 12913-1/-2/-3 proposes a descriptive framework based on a triangulation between the entities acoustic environment , person and context . While research on the person-related dimensions is well established, there is not yet complete agreement on the relevant indicators and dimensions for the pure description of acoustic environments. Therefore, this work attempts to identify acoustic dimensions that actually vary between different acoustic environments and thus can be used to characterize them. To this end, an exploratory, data-based approach was taken. A database of Ambisonics soundscape recordings (approx. 12.5 h) was ﬁ rst analyzed using a variety of signal-based acoustic indicators ( N i = 326) within the categories loudness , quality , spaciousness and time . Multivariate statistical methods were then applied to identify compound and interpretable acoustic dimensions. The interpretation of the results reveals 8 independent dimensions “ Loudness ” , “ Directivity ” , “ Timbre ” , “ High-Frequency Timbre ” , “ Dynamic Range ” , “ High-Frequency Amplitude Modulation ” , “ Loudness Progression ” and “ Mid-High-Frequency Amplitude Modulation ” to be statistically relevant. These derived latent acoustic dimensions explain 48.76% of the observed total variance and form a physical basis for the description of acoustic environments. Although all baseline indicators were selected for perceptual reasons, validation must be done through appropriate listening tests in future.


Introduction
Soundscape, as defined in ISO 12913-1 [1], is understood as a multidimensional framework whose evaluation is recommended by a triangulation between the aspects person, context and acoustic environment (AE) [2].With the increase of information about each of these aspects, the description and interpretability of a specific soundscape gains validity.As soundscape itself is a multi-disciplinary concept, research in this area follows diverse approaches and paradigms.There is a considerable body of research on the human-centered approach of soundscape, for example the identification of fundamental emotional dimensions [3][4][5].The dimensions for affective quality defined in [6], namely valence and arousal, were developed by means of factor or principle component analysis of a multitude of attribute scales [4,7].In contrast, there is only little agreement on the mere acoustical parameters, indicators and dimensions that describe the physical aspects of soundscapes in a discriminative way.Rather, the choice of parameters as indicators for individual hypotheses depends strongly on the respective discipline.Research attempts to model sound and noise quality and perception for example rely often on A-and C-weighted sound pressure levels or conventional psychoacoustic measures such as loudness, sharpness, roughness and fluctuation [8].An extensive review of models and underlying acoustic and psychoacoustic indicators can be found in [9].At the same time, researchers agree that these and other known parameters are not sufficient to model emotional arousal and valence [5], also due to the lack of contextual and other non-acoustic indicators.A different application in computer science is the training of algorithms for extracting information from recordings of acoustic environments such as in acoustic event detection (AED), acoustic event classification (AEC), acoustic scene classification (ASC) or acoustic scene analysis (ASA).Here, parameters like short-time spectrograms or Mel-frequency cepstral coefficients (MFCCs) are widely used.An attempt to cluster acoustic events on the basis of selected acoustic features for the mapping of urban sounds can be found in [10].However, the general question of which parameters and parameter combinations actually describe an acoustic environment adequately has not yet been answered satisfactorily.An exemplary overview of considered parameters for the respective research purposes can be found in Table 1.In the case of annoyance or quality modeling of soundscapes, an important question would be whether the selected signal properties are indeed capable of sufficiently modeling the annoyance or whether, for example, a particular combination of parameters has unexpected effects.This leads to manifold approaches to find suitable parameters that can act as an appropriate indicator for perceptual, cognitive or emotional reactions.This work therefore aims to support this research direction by collecting potential acoustic parameters on the basis of which soundscapes with distinguishable human-related properties can be evaluated.It follows the approach that was conceptually presented previously by the authors in [11].To this end, this paper takes a step back and examines which acoustic parameters occur at all with some variance in soundscapes.This exploratory approach is motivated by the development of the affective qualities, where a multitude of attributes are aggregated to a small number of emotional dimensions.In a similar way, a multitude of acoustic indicators are taken into account to form underlying, latent acoustic dimensions.With the aim that these dimensions are seen as distinguishing characteristics of acoustic environments, a variety of applications can be pursued such as modeling of human-centered responses [12,13], the ecological validation of soundscape reproductions where certain acoustic properties are to be preserved [14,15] or the training of algorithms for automatically deriving information from soundscape recordings [16].

Indicators for acoustic assessment
For the description of an acoustic environment, distinct quantifiable indicators must be identified and selected.Since the description aims to assess human-centered perception, a-priori categories for acoustic indicators are derived from semantic description: quality (in the following referred to as Q), loudness (L), spaciousness (S) and time (T).The category quality must not be confused with valence but represents characteristics, that helps human beings to identify sound sources, such as information on timbre and spectral composition as well as short time temporal succession.Loudness distinguishes whether an acoustic environment is perceived as loud or soft in volume, spaciousness represents location, distribution and envelopment of both the sound sources and indistinguishable background noise.The category time describes how the acoustic scene changes over time.Suitable indicators from literature sources (Tab. 1 and others) are assigned to these categories in the following listing.A detailed description, scientific sources and implementation of each indicator can be found in the Supplementary Material A.
Time: amplitude modulation (frequency and depth; periodic and stochastic); time series of all above indicators.
For this work, each indicator is calculated as time series for overlapping frames of 100 ms each and hop size of 50 ms to respect both time-integrating behavior of the human auditory model [17] and time-variance of acoustic scenes.It is recognized here that there may be acoustic events and psychoacoustic effects that are difficult to detect with this temporal resolution.At the same time, averaging through large analysis windows contributes to the increased robustness of the results against statistical and measurement noise.Furthermore, the majority of indicators is calculated frequency-dependent.For that, the broadband The quality and loudness indicators may be calculated either from a monophonic pressure representation or from a binaural signal, while the spaciousness indicators require binaural and spherical harmonic (Ambisonics) signal representation of the three-dimensional soundfield.The latter two representations incorporate spatial information of an acoustic environment such as the location of sound sources or the envelopment of sound.In order to maintain consistency and to reduce data complexity, all three representations stem from the same recording of a specific acoustic environment.For that, microphone array recordings are necessary that can be transformed into the spherical harmonic domain as it is established in Ambisonics encoding and rendering [18].The order of the ambisonic recordings generally determine the spatial confidence.However, even first-order Ambisonics (FOA) recordings are suitable for the analysis in this work.The binaural representation is derived by convolution with appropriate head-related transfer functions (HRTF) [19] as it is established in [20,21].The monophonic sound pressure representation on the other hand side is proportional to the 0th-order Ambisonics component [22].

Determining underlying dimensions
The idea pursued in this work is that the multitude of indicators contain information describing the properties of an acoustic environment that are relevant when a human being perceives and contextualises the same environment.Just as humans can classify their environment acoustically on the basis of their two ear signals, a procedure is now to be developed here that provides an abstract construct for the description and identification of acoustic environments on the basis of the indicators presented in the previous section.In other words it is assumed that the observed indicators above are realizations of certain underlying acoustic dimensions that characterize an acoustic scene or environment.These assumptions allow the application of exploratory factor analysis (FA) as schematically depicted in Figure 1.Similar to the related principle component analysis (PCA), FA can be used here to aggregate data variances (and thus information) by transforming the observed indicator time series from an original space into an optimized space of latent dimensions.The methodological differences between PCA and FA concern the perspective: while PCA assumes that the observed indicators constitute the ground truth, which in turn can be described by principal components, FA implies that the (hidden) latent factors constitute the ground truth and the observed indicators are more or less arbitrary realizations of it.For the sake of comparability, some taxonomies, measures and results of PCA are placed alongside those of FA in the following.The operation itself to obtain the factor scores Y in the optimized space is realized by matrix multiplication as shown in equation ( 1) where This measure indicates the weight of a particular jth factor, which is important when deciding which factors to retain.Dividing L by the respective explained variances yields the relative Loading L rel (in PCA: eigenvectors of covariance) that includes the assignment of the indicators to the respective factors and represents the direction of the transformation: In contrast to PCA, the factors in FA only express the common part of variance of the observed indicators.That means in practice that each indicator may inhibit portions of specific variance s as well as measurement noise e n which both is not included in the factors as denoted in Figure 1 with i = si + ni .Hence, we allow the indicators to be imperfect realizations of the factors which relaxes the necessary requirements of the indicators.
In order to apply FA to indicators of different scales and units, preprocessing of the initial indicator vectors must be applied.For that, an interval range of expected values was defined for each indicator and scaling was applied accordingly to derive relative values within this interval.Since FA is only capable to identify linear relationships, nonlinear indicators must also be treated accordingly.Ratioscaled indicators with reference to frequency/Hz are converted to frequency in octaves relative to 10 Hz to regard the logarithmic behavior of auditory pitch perception.Conversions and expectation intervals for each indicator can be taken from the Supplementary Material A. Finally, a z-standardization was applied to each indicator, that means removal of the mean and normalization to unit variance.Pure FA produces mutually independent (uncorrelated) factors where the first factor includes maximum variance.However this might result in a loading matrix that is difficult to interpret.In these cases, a further rotation of the loading matrix L aims for a simple structure with few high loadings and many low loadings.In this work the orthogonal rotation method varimax was chosen to preserve uncorrelated factors while increasing interpretability.

Application
The following section describes how the previously presented methodology is applied to real observational data of acoustic environments.To make generalizable statements about underlying dimensions, the sample draw of AE recordings for developing the final loading matrix L must be chosen with care.This work focuses on acoustic environments that can potentially be part of soundscape research and consists of indoor and outdoor recordings of public places with and without human impact.The selection consists of publicly available recording data bases where Ambisonics or microphone array recordings are utilised.The listed databases in the following are used.
All in all N o = 903,735 observations of 100 ms were analyzed for N i = 326 indicators.The multivariate methods PCA and FA both with and without subsequent varimax rotation were applied and performance metrics were compared.Figure 2 shows the cumulative explained variance portions of the four methods.In order to identify the most relevant factors or principle components respectively, parallel analysis was applied as well as the Kaiser criterion.The result of this relevance analysis can be found in Table 3.We recall that the goal of the multivariate methods here is to find latent dimensions that are manageable in number and interpretable at the same time.Thus we not only evaluate the mere number of relevant components but also their composition, i.e. what indicators contribute considerably to a certain dimension.We find that the varimax rotation leads to a fewer number of indicators with higher loading that contribute to the latent dimensions as intended, which is why we choose this processing step to be beneficial.We can also observe that the parallel analysis assumes a lower number of components to be relevant which also aligns with expectations as well as our aim.After all, the first 8 varimax-rotated factors explain 49% of the total variance, which is generally just a moderate result but a reasonable starting point given a large input of N o = 903,735 observations with N i = 326 indicators.An investigation of the composition of these 8 factors reveals its interpretability in terms of acoustical semantics.Table 4 lists the indicators that contribute to a certain factor.The indicators are sorted in descending order by their absolute relative loading in parentheses.Only those indicators are listed whose cumulative sum of squares describes at least half of the respective factor's variance and thus characterize it.

Results
The factors are sorted with decreasing amount of explained variance.It has to be mentioned that the amount of variance that can be explained with a certain factor drastically depends on the initial choice of indicators.From that follows that the explanatory power of the absolute amount of explained variance within the factors should not be overestimated rather than the indicator composition itself.
Factor 1 obviously describes the level or loudness of the soundscape recording.Indicators such as loudness (Zwicker), A-weighted SPL and loudness (LUFS) dominate this factor especially within the frequency bands number 3-5 (corr.to f c = 250. ..1000Hz).
Factor 3, which explains the second most portion of variance, comprises the spherical directivity indices of the   incoming soundfield representing information whether the sound energy arrives from a certain direction or region or if it surrounds the respective receiver position.Again the mid-high frequency bands are prominent in discriminating the observations.Factor 2 includes mainly spectral characteristics.To have these indicators forming a prominent factor is somewhat surprising from a statistical point of view since these indicators are not calculated in frequency bands and thus contribute only once to the total explained variance.It is also interesting that these indicators form a factor together with SPL and loudness indicators in the low-frequency bands 0 and 1.
Factor 5 includes high frequency content such as SPL and loudness in the frequency bands 8 and 9 (f c = 8. ..16 kHz) but also sharpness which is measure for high frequency spectral characteristic.Obviously this higher frequency region is more or less independent of the general spectral timbre as described in the previous factor 2.
Factor 7 comprises loudness range indicators following the algorithms according to [26][27][28].It has to mentioned that the calculation of this loudness range is usually performed as singular value for a specific audio content (e.g.music or movie).The calculation as time series of soundscape recordings describes the loudness range of the respective previous time interval from start of the recording to the current timestamp.This leads to the fact that the observations depend on previous time periods and thus violate the requirements for FA.This behavior is also reflected statistically, as no other indicator (group) contributes to factor 7. Therefore, it may be necessary to omit this factor altogether from further analysis.
A different analysis of temporal behaviour of loudness or level can be observed in factor 6 which includes especially the modulation depth of the first three dominant periodic modulations as (modDepthP) as well as the remaining stochastic modulation (modDepthS) for the high frequency range of bands 8 and 9 (f c = 8. ..16 kHz).This modulation behaviour obviously differs from that in the mid-high range (bands 5 and 6) that we can observe in factor 8.
Factor 4 mainly summarizes loudness indicators of short-time averaged LUFS.Since these indicators inhibit a moving time window of 3 s, the resulting factor can be interpreted as loudness progression.It is noteworthy that this specific temporal characteristic is statistically independent from general loudness in factor 1.
In summary we observe that the first 8 relevant factors can be interpreted quite well in terms of dominant indicators that contribute to them.Table 5 lists the semantic descriptors that are proposed from the findings above.

Plausibility considerations
In order to conduct steps for validation of the identified acoustic dimensions, we show a comparative example at this point.This includes a sample draw of 19 excerpts of 30 seconds each of the soundscape recording data base as listed in Table 6.The sample includes examples from each recording database for a range of acoustic environment classes that potentially contain the three sound source classes according to ISO 12913-2, namely sounds of technology, sounds of nature and sounds of human beings.The excerpts were selected subjectively by the authors aiming for two criteria: (i) a homogeneous listening impression throughout a sample, to allow semantic and statistic description that is valid for the entire excerpt and (ii) samples that have potentially similarities and differences within the identified acoustic features.The acoustic indicators were calculated and extracted and the resulting factor scores according to equation (1) were deduced.The distributions of factor scores for these samples can be found in Figure 3 as boxplots of median, 25% and 75% quantiles where outliers are omitted for better visibility.For comparison reasons, the distribution of all 109 analyzed soundscape recordings are listed in the Supplementary Material B. The samples' factor score distribution were analyzed with regards to normality and homoscedasticity which could not be asserted for every case and which is why non-parametric statistic methods have been applied.A Kruskal-Wallis test on ranks proposes significant differences among the samples within all acoustic dimensions (H > 9500, p < 0.01).Subsequently, pairwise Dunn's posthoc tests with Bonferroni adjustment were performed comparing all 19 samples with each other for each dimension.The result whether each comparison pair differs significantly can be found in Figure 4.It is noteworthy that the majority of these comparisons exhibit strong significant differences with p < 0.01 ( ** , red tiles).This result might be influenced by the relatively large number of observations (30 s Â 20 observations/s) and should at this point only describe the difference from a statistical point of view.An exemplary comparison of those soundscape excerpts that are specifically quiet in loudness, namely the library (ID: 1), park (ID: 9) and woodland (ID: 14) scenarios, shows similar and different properties with respect to their acoustic dimensions.The distributions of the dimension Loudness are relatively low (cf.Fig. 3, top left) and the differences between the soundscape excerpts are not significant (cf.Fig. 4, top left).The other acoustic dimensions of these excerpts show similar but still significantly different distributions.This underlines the expected discriminating characteristics of the observed soundscapes: even though the loudness dimension seems to be similar, other dimensions show significant differences.

Discussion
With the presented application of multivariate methods on a wide range of acoustic indicators of soundscape recordings, it is possible to extract statistically independent factors to serve as underlying acoustic dimensions.It could also be shown that an interpretation of these factors based on the indicator composition was feasible in terms of finding appropriate semantic descriptors.These descriptors can generally be assigned to the a-priori categories loudness, quality, spatiousness and time and thus confirm the assumption that acoustic environments can be described with these terms.The fact that each of these categories is represented by more than one factor (e.g.loudness: factor 1 and 4; quality/timbre: factor 2 and 5; time/modulation: factor 6 and 8) can be interpreted such that the selection of indicators is crucial when physical characteristics of acoustic environment shall be described.With other words, if different acoustic indicators are chosen for the multivariate analysis, different factor compositions may be observed.In order to validate the deduced underlying acoustic dimensions, discriminative investigations must be rolled out in future.These include both the statistical differentiation of specific soundscape recordings as well as perceptual evaluation if these dimensions are also taken into account when human subjects characterize acoustic environments.Therefore, this paper should serve as an invitation to evaluate and refine the proposed acoustic dimensions.

Conclusion
The presented paper discusses the need of suitable acoustic descriptors for characterizing the physical properties of soundscapes.For that, an approach was pursued that is comparable to the identification of semantic dimensions of perceptual assessment, namely the application of multivariate statistic methods to reveal underlying constructs of observable variables.In this work these methods were adapted to acoustic signal indicators.In total 903,735 short-term observations of 326 indicators within 109 recordings of soundscapes were fed into factor analysis (FA) and relevant factors were deduced.With this set of eight underlying dimensions 49% of the overall observed variance could be explained and interpretable semantic descriptors could be found.The presented approach allows the description of acoustic environment in an efficient and comprehensive way.Various areas of application may benefit from this descriptive set of acoustic dimensions, e.g.computer-based applications such as acoustic scene analysis and classification or perception-based applications such as soundscape quality estimation or annoyance modelling.Furthermore, if this approach receives confirmation, it can contribute to be used as a comparative benchmark method for soundscape description and analysis.The derived results are limited by certain implications and conditions.First, the initial choice of indicators may influence the statistic outcome especially within the amount of explained variance.Second, a perceptional validation by means of appropriate listening tests is still pending.Appropriate listening tests are currently conducted and results will be published in the near future.Third, the influence of signal analysis parametrization, including time and frequency resolution, Ambisonics decoding scheme, and binaural convolution, needs to be quantified, especially when merging statistical and perceptual results.

Figure 1 .
Figure 1.Concept of factor analysis with loadings l ij and unique variances i .

Figure 2 .
Figure 2. Explained variance ratio in % for PCA and FA both with and wihtout varimax rotation.

Figure 3 .
Figure 3. Distribution of factor scores amongst the soundscape recordings for each relevant factor.

Table 1 .
Exemplary overview on research using different sets of signal parameters.
N o : number of observations; N i : number of indicators) of the original data and L a specific loading matrix of dimension [N i Â N f ] (N f : number of factors).The loading matrix comprises the individual weights of each indicator into each factor.The sum down the rows, i.e. among indicators yields the sum of square loadings (in PCA: eigenvalues of covariance matrix) or explained variance of a certain factor

Table 2 .
Frequency limits in Hz of analysis bands.

Table 3 .
Number N f,r of relevant factors (FA) or principle components (PCA) according to Kaiser criterion and parallel analysis.

Table 4 .
Indicator composition of the 8 most relevant factors.Trailing index numbers denote the respective frequency band ID as listed in Table 2. Factors in parentheses indicate respective loadings.

Table 5 .
Summarized semantic descriptors for the first eight relevant factors.

Table 6 .
Sample draw of soundscape excerpts for exemplary pairwise comparison.